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Preface 


The ocean accounts for about 71% of Earth’s surface, yet many aspects remain 
a mystery. Understanding ocean circulation, biogeochemical cycles, and various 
marine resources directly impact human activities in the 21st century. The enhanced 
utilization of satellites and autonomous observation platforms has acquired in-situ 
and remotely sensed data at high spatial and temporal resolutions for the past four 
decades, entering the Big Ocean Data Era. However, the human capacity to filter, 
curate, and analyze these data is limited. In the era of big data, efficiently obtaining 
helpful information from massive data has become a new challenge in oceanographic 
research. 

Artificial intelligence technology has been ubiquitously applied across scientific 
domains and disciplines and achieved tremendous success. For example, machine 
learning approaches have been widely used in computer vision, medical, or geophys- 
ical fields. Machine learning is an application of artificial intelligence that aims to 
provide systems to learn from experience without human intervention automatically. 
With the rapid increase in computing power in recent years, deep learning, a more 
advanced machine learning technology has begun to show its powers in solving very 
complex, nonlinear, high-dimensional problems. Promisingly, these artificial intelli- 
gence approaches also have enormous potential to improve the quality and extent of 
ocean research by identifying latent patterns and hidden trends, particularly in large 
datasets that are intractable using other traditional methods. In addition, the new 
data-driven and learning-based methodologies may propose novel computationally 
efficient strategies to improve oceanographic research. 
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This book brings together state-of-the-art studies on the broad theme of artifi- 
cial intelligence applications in oceanography, including pure data-driven forecasts, 
dataset reconstruction, and detection or extraction of oceanic features from remote 
sensing imagery. The comprehensive contributions clarify the tremendous poten- 
tial for artificial intelligence technology to contribute to rapid advances in ocean 
science and may inspire readers of related disciplines. As the editors of this book, we 
would like to thank all the contributors for their fruitful cooperation and the editorial 
assistance from Dr. Shuangshang Zhang. 


Qingdao, China Xiaofeng Li 
December 2021 Fan Wang 
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1 The Development of Artificial Intelligence 


Artificial intelligence (AI) is the core driver for the fourth technological revolution, 
following the revolutions in steam technology, electricity technology, and computers 
and information technology. Since its emergence in the 1950s, AI has fully improved 
productivity, affected and changed the production structure and production relations. 
Understanding the history of AI plays an indispensable role in the subsequent research 
and the development of AI technologies. AI can be divided into three generations, 
according to the difference in the drive mode. The subsequent subsections introduce 
each of these three generations of AI. 


1.1 The First-Generation Al 


Turing proposed the “Turing Test” in 1950 [49]. It states that if a machine can answer 
a series of questions posed by a human tester within five minutes, and more than 
30% of its answers can deceive the tester into thinking that they are answered by 
a human, then the machine can be considered intelligent. In the same year, Turing 
predicted the feasibility of intelligent machines. The “Turing test” can be seen as the 
genesis of AI. Newell and Simon [41] developed the first heuristic program in the 
world: Logic Theorist. It successfully proved 38 theorems in the book: “Principles of 
Mathematics”, by simulating human thinking activities. This program successfully 
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Knowledge Driven Artificial Intelligence 


Knowledge Representation Knowledge Reasoning 


INDUCTIVE DEDUCTIVE 
reasoning reasoning 


No matter how unrealistic that sounds, in many Neide swch as science and law, 
“proot simply doesn] esist, there can only be facts amd evidence that lead you to 


Semantic Network Knowledge Graph (Non-) Monotonic reasoning (Non-) Deterministic reasoning 


Fig. 1 The research field of the first generation of artificial intelligence. The first generation of Al is 
knowledge-driven AI, which mainly conducts research on knowledge representation and reasoning. 
Production rules, predicate logic, semantic network and knowledge graph are common knowledge 
representations. Inductive reasoning, deductive reasoning, Monotonic reasoning and deterministic 
reasoning are mainstream reasoning methods 


demonstrated the feasibility of the predictions posed by Turing, and it is considered 
the first successful AI program. In August of the same year, the concept of “artificial 
intelligence” was first introduced by John McCarthy, Herbert Simon, and a group 
of scientists from different fields at Dartmouth College. Thus, AI stands on the 
stage of history as an independent discipline. Newell and Shaw [40] invented the 
first AI programming language, the information processing language (IPL). It used 
symbols as basic elements and proposed a reference table structure instead of storing 
addresses or arrays. McCarthy [36] developed a list processing language based on 
the IPL, which was widely used in the AI community. 

The first generation of AI is known as knowledge-driven AI; these Als allow 
machines to learn by imitating the process of human reasoning and thinking. As 
shown in Fig. 1, the core steps can be divided into two parts, knowledge representation 
and knowledge reasoning. 

Knowledge representation is required to allow machines to achieve intelligent 
behavior. It represents human-understood knowledge in a certain data structure that 
allows machines to understand and complete the processing. The methods of knowl- 
edge representation include predicate logic, production rules, semantic network rep- 
resentation, and knowledge graphs. 

Predicate logic can describe how the human mind works. From the logical system, 
propositional logic is the simplest logical system, and it is used to describe declarative 
sentences using truth values. For example, “the sea is blue”; the “true” and “false” 
of each proposition is called the truth value. Propositions can be divided into atomic 
propositions and positions. An atomic proposition is a proposition that cannot be split 
into simpler declarative sentences. Compound propositions are detachable proposi- 
tions consisting of atomic propositions and connectives. However, they both have 
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limited expressive power and can only represent established facts. Thus, predicate 
logic is developed based on propositional logic. It uses connectives and quantifiers 
to describe objects, and predicates on objects to represent the world. Predicates 
of objects refer to the properties of objects or the relationships between objects. 
A constant symbol, a predicate symbol and a function word comprise a predicate. 
Constant symbols represent objects and predicate symbols represent relationships 
or attributes. For example, the constant symbol Susan, the predicate symbol mother, 
and the function word nurse compose the predicate logic “nurse(mother(Susan))”, 
which indicates that the mother of Susan is a nurse. Predicate logic representation 
has certain advantages: naturalness, accuracy, rigor, and ease of implementation. 
However, it cannot represent uncertain knowledge, and when it is used to describe 
too many things, it becomes inefficient. 

Production rules are used to describe the cause-effect relationships between things. 
For example, if an animal is a mammal and has a long trunk, then the animal is an 
elephant. The generative system consists of a rule base, comprehensive database, and 
control system. The rule base is used to represent a set of rules for inferring conclu- 
sions from premises. The database is used to store known conditions, intermediate 
results, and final conclusions. The control system is used to select suitable rules for 
inference from a rule base. 

Semantic network representation is a network graph that represents knowledge 
through entities and their semantic relationships. It consists of nodes and arcs. Nodes 
represent entities, which are used to describe various things, concepts, situations, 
attributes, states, events, actions, etc., and arcs represent semantic relations, such as 
the instance relations, classification relations, membership relations, attribute rela- 
tions, inclusion relations, temporal relations, location relations, etc. These basic units 
are interconnected to form a semantic network. 

A knowledge graph is essentially a semantic network that reveals the relation- 
ships between entities, allowing a formal description of entities and their interrela- 
tionships in the objective world. The study of knowledge graphs originated from a 
semantic web. Tim Berners Lee [2] proposed the concept of a semantic web at the 
XML Conference in 2000, expecting to provide services such as information proxy, 
search proxy, and information filtering by adding semantics to web pages. In 2005, 
Metaweb was established in the United States to develop an open knowledge base 
for web semantic services. It extracts entities (people or things) in the real world 
and relationships between them based on public datasets such as Wikipedia and the 
United States Securities and Exchange Commission (SEC), and then stores them with 
a graph structure on a computer. In 2010, Google acquired MetaWeb and acquired 
its semantic search technology. In 2012, Google formally put forward the concept 
of a knowledge graph, aiming to improve the capability of the search engine and 
enhance the search experience of users based on knowledge graphs. 

Reasoning is a form of thinking that logically derives new conclusions from known 
premises. Knowledge reasoning represents the process of using knowledge, which is 
one of the core issues in AI research. It uses previous knowledge to derive conclusions 
by reasoning, and solves the corresponding problems. Reasoning can be divided into 


4 X. Li et al. 


deductive reasoning and inductive reasoning, depending on how the conclusion is 
derived. Where deductive reasoning is a reasoning process from general to special, 
inductive reasoning is a reasoning process from special to general. Reasoning can 
also be divided into monotonic and nonmonotonic reasoning, depending on whether 
the conclusions derived in the reasoning process increase monotonically. Monotonic 
means that the number of propositions known to be true strictly increases as the 
reasoning progresses. According to the certainty of reasoning, reasoning can also 
be divided into deterministic and non-deterministic reasoning, where deterministic 
means that the knowledge used in reasoning and the conclusions derived are either 
true or false, whereas non-deterministic means that the knowledge used in reasoning 
and the conclusions derived are probabilistic. 

In the era of knowledge-driven AI, the emergence of expert systems has brought 
Al into a period of vigorous development. An expert system is an intelligent computer 
program that introduces the knowledge of a specialized field. Through knowledge 
representation and reasoning, it can simulate the decision-making process of human 
experts to solve the problems in this field and provide suggestions for the users. The 
first expert system was DENDRAL [4], developed by Feigenbaum in 1968. It was 
used to analyze the molecular structure of organic compounds by mass spectrometry. 
In the 1970s, the idea of expert systems was gradually accepted. A series of expert 
systems were developed to solve problems in different fields at that time, such as 
MYCIN [46] for diagnosis and treatment of blood infection diseases, MACSYMA 
[35] for symbolic integration and theorem proving, and PROSPECTOR [14] for 
seismic exploration. Subsequently, the application field of expert systems expanded 
rapidly, and the difficulty of dealing with problems increased continuously. Several 
tool systems for building and maintaining expert systems have been developed. In 
the 1980s, the development of expert systems gradually became specialized, creating 
huge economic benefits. 

Although scientists at that time had great expectations for knowledge-driven AI, 
there were some fundamental problems in its development. The first problem was the 
interaction problem. Traditional methods could only simulate the thinking process 
of human beings, but could not simulate the complex interactions between humans 
and the environment. The second problem was the expansion problem. Traditional 
methods were only applicable to the development of expert systems in specific fields, 
and could not be extended to complex systems with larger scales and wider fields. 
The third problem was the application problem. Research on traditional methods 
was detached from the mainstream computing (software and hardware) environment, 
which seriously hindered the practical application of expert systems. Constrained by 
the above mentioned problems, the first generation of AI eventually declined. 
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1.2 The Second-Generation AI 


First-generation AI is based on symbols, which believe that sensory information is 
expressed in a certain encoding way. Second-generation AI establishes a stimulus- 
response connection in a neural network. This ensures the generation of intelli- 
gent behavior through connections. Figure 2 illustrates the development of second- 
generation AI 

In 1958, Rosenblatt established the prototype of an artificial neural network, the 
perceptron, which followed the idea of connectionism. The perceptron was inspired 
by two aspects. One was the neuron mathematical model proposed by McCulloch 
and Pitts in 1943 [37]: the threshold logic circuit, which converted the input of 
neurons into discrete values. The second was from the Hebb learning rate proposed 
by D.O.Hebb in 1949 [23], that is, the neurons fired at the same time are connected. 

In 1969, Minsky and Papert [38], pointed out that perceptron could only solve 
linearly separable problems. In addition, it was not practical when the number of 
hidden layers increased as it lacked an effective learning algorithm. The criticism 
of the perceptron posed by Minsky proved to be fatal; thus, second-generation AI 
declined for more than 10 years. 

Regarding difficulty, through the joint efforts of many scholars, significant 
progress has been made in both neural network models and learning algorithms. 
In addition, mature theories and technologies have gradually formed over the past 
30 years. 

For example, the gradient descent method was proposed by the French mathe- 
matician Cauchy [5]. The method is used to solve the minimum value along the 
direction of gradient descent or to solve the maximum value along the direction of 
the gradient rise. 

Another example is the back-propagation (BP) algorithm [43]. The algorithm 
consists of a forward propagation process and a BP process. In the forward propa- 
gation process, the input information enters the hidden layer through the input layer. 
Then, the information is processed layer by layer and passed to the output layer. 


1969 1998 2012 
XOR problem LeNet-5 AlexNet 
E E © 2 a 
1958 1986 2006 2016 
Perceptron Back propagation Deep belief net AlphaGo 


Fig. 2 The development history of the second generation of artificial intelligence. The second- 
generation AI has developed since the advent of the perceptron in 1958. The birth of AlphaGo in 
2016 entered a period of rapid development 
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If the desired output value cannot be obtained in the output layer, the sum of the 
squares of the error between the output value and the expected value is taken as the 
objective function. In the BP process, the partial derivative of the objective function 
to the weight of each neuron is obtained layer by layer as the basis for modifying 
the weight. The learning of the network is completed using the weight modification 
process. When the error reaches the expected value, the network learning ends. 

Regarding the loss function, a series of improvements have been made, such 
as the cross-entropy loss function [28]. Cross entropy is an important concept in 
Shannon’s information theory, and is mainly used to measure the difference in infor- 
mation between two probability distributions. The performance of a language model 
is typically measured using cross-entropy and complexity. Cross-entropy means the 
difficulty of text recognition within a model or the average number of bits used to 
encode each word from a compression viewpoint. Complexity means to use the model 
to represent the average number of branches in this text. Cross entropy is introduced 
to the neural network field as a loss function. We used p to represent the distribution 
of the true markers and q to represent the predicted marker distribution of the trained 
model. The cross-entropy loss function measures the similarity between p and q. 

Algorithm improvements, such as regularization methods, prevent over-fitting 
[52]. Regularization involves imposing constraints that minimize the empirical error 
function. Such constraints introduce prior distributions to the parameters and have a 
guiding effect. When optimizing the error function, they tend to choose the direction 
that reduces the gradient and satisfies the constraints; hence, the final solution tends to 
conform to prior knowledge. At the same time, regularization solves the ill-posedness 
of the inverse problem. The resulting solution exists uniquely and depends on the data. 
The influence of noise on the ill-posed is weak. If the regularization is appropriate, 
the solution will not overfit, even if the number of uncorrelated samples in the training 
set is small. 

New network architectures have been developed, such as convolutional neural net- 
works (CNNs) [13], recurrent neural networks (RNNs) [33], long short-term memory 
neural networks (LSTM ) [25], and deep belief networks (DBN) [24]. 

CNNs are a type of feedforward neural network (FNN) that includes convolu- 
tion calculations and has a deep structure. A CNN has the abilities of representation 
learning and can perform the shift-invariant classification of the input information 
according to its hierarchical structure. A CNN is constructed by imitating the biolog- 
ical visual perception mechanism, which can perform both supervised and unsuper- 
vised learning. The convolution kernel parameter sharing in the hidden layer and the 
sparsity of inter-layer connections enable the CNN to perform smaller calculations 
to extract features. 

An RNN is a type of recursive neural network in which all node cyclic units are 
connected in a chain. RNNs have applications in natural language processing (NLP) 
fields, such as speech recognition, language modeling, machine translation, and other 
fields. An RNN can be combined with a convolution operation to handle computer 
vision problems. 

The LSTM network is a time RNN, which is specifically designed to solve the 
long-term dependence problem in general RNNs. When receiving new input infor- 
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mation, the network first forgets all the long-term information that it does not require. 
Afterward, it learns which part of the new input information has use value and saves 
them in long-term memory. Finally, the network learns which part of the long-term 
memory can work immediately. 

A DBN is a deep neural network with multiple hidden layers. It is a combination 
of unsupervised feature learning and supervised parameter adjustment. It performs 
unsupervised greedy learning using stacked restricted Boltzmann machines to extract 
high-level and abstract features from the original data. It uses a BP neural network 
to reversely fine-tune the parameters to realize the supervised learning of data. 

Together, these works ushered in a new era of second-generation AI based on 
deep learning. Owing to the universality of deep neural networks, these networks can 
approximate any function. Therefore, using deep learning to determine the function 
of the data has a theoretical guarantee. 

In 2014, deep learning was pointed out to be vulnerable to spoofing and attacks. 
Owing to the uncertainty of observation and measurement data, the acquired data 
must be incomplete and contain noise. In this case, the choice of the neural network 
structure is extremely important. If the network is too simple, there is a risk of 
underfitting; if it is complicated, overfitting occurs. Although the risk of overfitting 
can be reduced to a certain extent through various regularization methods, it will 
inevitably lead to a serious decline in the promotion ability if the quality of the data 
is poor. 


1.3 The Third-Generation AI 


The third generation of AI needs to solve the shortcomings of the first and second 
generations of AI. To establish sound AI theory, developed AI technology must 
be safe, credible, reliable, and scalable. Only when the above conditions are met 
can a real technological breakthrough be achieved, which will produce new inno- 
vative applications of AI. The best current approach is an organic combination of 
the first-generation knowledge-driven approaches and the second-generation data- 
driven approaches. The combined use of knowledge, data, algorithms, and arithmetic 
power results in a more powerful AI. 

Research on third-generation AI must move towards making AI capable of power- 
ful knowledge and reasoning. For this purpose, we need to draw on classic examples, 
such as the Watson conversational system, which was introduced in 2011. The fol- 
lowing lessons on knowledge representation and inference methods from this system 
are worth learning. First, automatic generation of structured knowledge representa- 
tions from a large amount of unstructured text. Second, a method for representing 
knowledge uncertainty based on the knowledge quality scoring. Third, an approach 
based on multiple reasonings to achieve the uncertainty reasoning. The develop- 
ment of third-generation AI requires strong knowledge and reasoning capabilities, 
in addition to strong perception. There have been some tentative efforts to apply 
the principle of sparse discharge to the computation of ANN layers [26]. Specifi- 
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cally, the network is trained with simple background images, such as “human,” “car,” 
“elephant,” and “bird”, as training samples. The neurons representing these “cate- 
gories” appear in the output layer of the neural network. The network responds to 
the contours of the human face, car, elephant, and bird. In this way, the semantic 
information of the “whole object” is extracted, thus making the neural networks per- 
ceptive. However, this approach can only extract part of the semantic information, 
it cannot extract different levels of semantic information; therefore, further research 
is needed. Furthermore, the third generation of AI also needs to interact with the 
environment. Reinforcement learning has made good progress in many areas, such 
as video games [39, 50], board games [47, 48], robot navigation and control [11, 44], 
and human-computer interaction. In some tasks, the performance of reinforcement 
learning approaches used in the networks even surpasses that of humans. 

Attempts on the third generation of AI have emerged in the academic commu- 
nity. Zhu et al. [56] proposed the use of a triple-space fusion model, that is, a model 
that fuses both dual-space and single-space approaches. The model has the oppor- 
tunity to be interpretable and robust. When the model can convert sensory signals 
such as vision and hearing into symbols, the machine has the opportunity to develop 
comprehension capabilities, which will help solve the problem of interpretability 
and robustness of the model. If the symbols in the machine can be generated by 
the perception of the machine, then the symbols and symbolic reasoning can gen- 
erate intrinsic semantics, which can hopefully solve the problem of interpretability 
and robustness of machine behavior at the root. Among the models proposed by 
Zhu et al. [56], the single-space model is based on deep learning and suffers from 
being uninterpretable and having poor robustness. The dual-space model mimics 
the working mechanism of the brain; however, the model has some uncertainties. 
For example, the machine can establish “intrinsic semantics” through the reinforce- 
ment learning of the environment, yet it is not certain whether these semantics are 
consistent with the “intrinsic semantics” acquired by humans through perception. 
Therefore, there are many uncertainties associated with this approach. Despite these 
difficulties, we still believe that machines that take steps in this direction will edge 
closer to true AI. The single-space model is based on a deep learning algorithm that 
fully uses the computational power of the computer. In some aspects, it already out- 
performs humans. However, there are many uncertainties in this approach, and it is 
still unknown how much progress can be made through algorithmic improvements. 
Therefore, to achieve the goal of third-generation AI, the best strategy is to simulta- 
neously advance along two lines, namely, the convergence of the three spaces. This 
will maximize the working mechanism of the brain and make full use of the com- 
puting power of the computer. The combination of the two approaches is expected 
to lead to a more powerful AI. 

The third generation of AI is a new form of AI driven by data and knowledge in 
concert. It fits perfectly with our need to explore the ocean using the ocean and its 
data. This new form of AI technology will be a powerful tool for people to understand 
the ocean and further develop it. The combination of AI technology and research in 
the ocean field will open a new chapter in oceanographic research. 
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At present, third-generation AI is still in its initial stage, and the AI used in 
academia and industry is mainly second-generation. Among the second-generation 
AI technologies, deep neural network-based AI technologies are the most represen- 
tative. Therefore, the rest of this paper introduces common neural network structures 
and their applications. 


2 The Architecture of Deep Neural Networks 


2.1 Deep Feedforward Neural Network 


A deep FNN is a typical deep learning model that can be viewed as a mathematical 
function. It realizes the complex mapping from input to output using a combination 
of nonlinear functions. The following section introduces the neuron algorithm, BP 
algorithm, single-layer FNN, and multi-layer FNN. 


2.1.1 Neuron 


A biological neuron usually contains multiple dendrites and axons. Dendrites are 
used to receive signals transmitted from other neurons, and the axon has multiple 
endings for transmitting signals to other connected neurons. Psychologist McCulloch 
and mathematician Pitts proposed an abstract neuronal model based on biological 
neurons in 1943 [37]. 

A complete neuron can be viewed as a computational process of “input, numerical 
computation, and output.” The “numerical computation” consists of “a linear part 
and a non-linear part.” Figure 3 shows a typical biological neuron and neuron model. 

This neuron contains three inputs, a;, a2, a3, and one output. By assigning the 
corresponding weights w1, w2, and w3 to each input, the output of the linear part can 
be expressed as 

Z = a x W1 + a2 x W2 + a3 x w3 + b (1) 


where b is the bias term. 
The nonlinear part is implemented by the activation function, and the output of 
the neuron can be expressed as 


y= fk) (2) 


where f(-) is the activation function. 
The activation function is a nonlinear mapping relationship and must adhere to 
the following three conditions: 


(1) It must be a continuous and differentiable (can be non-differentiable at finite 
points) nonlinear function. 
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Fig. 3 Comparison of a biological neuron and neuron model. The left side shows a biological 
neuron. It consists of a nucleus, a cell body, an axon, dendrites and axon terminals. The right 
side shows a neuron model. It can be viewed as a computational process of “input, numerical 
computation, and output” 
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Fig. 4 Function diagrams of some common activation functions 


(2) The activation function and its derivative should be simple; overly complex 
functions are not conducive for network efficiency. 

(3) The value domain of the derivative should be limited to a suitable interval; this 
is beneficial for improving network efficiency. 


Common activation functions are shown in Fig. 4. 
In a neural network, the process of the input going through the neuron to compute 
the output is called a forward propagation algorithm. 
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2.1.2 Backpropagation Algorithm 


It is not enough to know only the neuron in neural networks because the parameters w 
and b need to be learned. Therefore, it is necessary to introduce a BP algorithm to 
update the parameters. 

In supervised learning, we have a dataset containing the inputs and the corre- 
sponding outputs. We call the correct output a label. The purpose of training a neural 
network is to learn the correct mappings of inputs to outputs. 

First, we assign random values to the parameters w and b, and generate the pre- 
dicted values using the forward propagation algorithm. We define a cost function 
J (w, b) to represent the closeness of the predicted output to the label. We use cross- 
entropy as the cost function in classification problems where the output is a discrete 
variable, and the mean-squared error is used as the cost function in regression prob- 
lems where the output is a continuous variable. 

The optimization formulas for the cross-entropy cost function and the mean- 
squared error cost function are expressed as follows: 


K 
min J(w, b) = — È y; log p; (3) 
i=l 
1 K 
in J(w, b) = — X Oi - pi)? 4 
min J(w, b) KL Pi) (4) 


where K is the number of samples, y; is the label value of the i-th sample, and p; is 
the predicted value of the i-th sample. 

Thus, the objective is transformed into solving for the parameter values that min- 
imize the cost function J(w, b). The gradient descent algorithm is generally used to 
solve optimization problems. The gradient indicates that the directional derivative 
of a function at that point has a maximum value along that direction. The specific 
process is as follows: it first randomly selects a set of parameter values. Then, it com- 
putes the predicted output and the cost function J(w, b), and computes the gradient 
of the cost function J (w, b) on w and b. Finally, we use the gradient to update w and 
b so that the cost function takes the minimum value. The modified expressions for 
the parameters w and b are expressed as: 


= u 8) (5) 
ðw 

ie e (6) 
ob 


where a is the learning factor and denotes the step length of each gradient descent. 
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The process of updating the parameters w and b along the gradient direction to 
minimize the cost function is called the gradient descent algorithm. 

The BP algorithm is a specific implementation of the gradient descent method on 
deep neural networks. Owing to the deepening of the network layers, the gradient 
of the parameters of each layer must be computed from backward to forward for the 
cost function and be continuously updated. 


2.1.3 Single-layer Feedforward Neural Network 


The FNN is the most common type of neural network in which neurons are arranged 
in layers. Each layer has several neurons, and each neuron is connected only to the 
neurons in the previous layer. Neurons in each layer only receive the input signal 
from the previous layer and output the processed signal to the neurons in the next 
layer. The first layer is called the input layer and the last layer is called the output 
layer. The remaining intermediate layers are called hidden layers; these can be one 
or more layers. The signal enters the network from the input layer and is transmitted 
layer by layer to the output layer. There is no feedback in the entire network, and 
the output does not affect the network model or network input. This section depicts 
a single-layer FNN containing one hidden layer as an example; this will allow us to 
introduce the basic principles of FNNs. 

Figure 5 shows the general structure of a single-layer FNN, where x") denotes 
the n-th feature of the m-th sample, and the number of neurons in the input layer is 
the same as the number of features of the sample. y denotes the k-th output of the 
m-th sample and k is the number of neurons in the output layer, where k > 2 for the 
classification problem and k = 1 for the regression problem. 

The learning process of single-layer FNNs consists of forward propagation and 
BP; this is illustrated below with a simple network in Fig. 6. The input is propagated 
forward along the direction of the network structure to the output layer, and then the 
weights and bias are updated by the BP algorithm. 

In the forward propagation, information is propagated from the input layer to the 
output layer after being processed by the hidden layer, and the state of the neurons in 
each layer only affects the state of neurons in the next layer. Assuming that the input 
layer is the 0-th layer, the output of the neurons in the hidden layer and the output 
layer can be expressed as: 


1 
hP = oP (xP x wit + xO xw + xO «wi + bl )) (7) 
1 
hP = pP (ax * w!? + xO * w? + xO * w? + b$ )) (8) 
2 1) 11 1 21 2 
y = pP (AM x wh! + AD xw! + bd) (9) 


where wil denotes the connection weight of the i-th neuron in layer / — 1 to the j-th 
D 


neuron in layer l, b j 


denotes the bias used to compute the linear weighted summation 
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Fig.5 The general structure of single-layer feedforward neural networks. The blue, gray, and yellow 
neurons form the input layer, hidden layer, and output layer of the neural network, respectively 
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Fig. 6 A simple single-layer feedforward neural network 
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of the j-th neuron in layer /, g denotes the activation function of the neuron in layer 
i hP denotes the output of the j-th neuron in the hidden layer, and y denotes the 
output of the neuron in the output layer. 

In BP, the gap between the network output and the real value is first calculated; 
this is called the loss function. Subsequently, the gradient of the parameters, such 
as the weight and bias term, is calculated, and the parameters are updated by the 
gradient descent to minimize the loss function. The learning process of the network 
is continuously iterated to optimize the network model until the loss function is 
sufficiently small or the maximum number of learning times is reached. 

When the structure and weights of a neural network are determined, the network 
forms a nonlinear mapping from the input to the output. For a single-layer FNN, 
if the number of neurons in the hidden layer is large enough, it can approximate 
any continuous function on a bounded region with arbitrary accuracy and can solve 
complex nonlinear classification tasks well. 


2.1.4 Multi-layer Feedforward Neural Network 


A network that contains multiple hidden layers between the input and output layers 
is called a multi-layer FNN, which can also be called a deep FNN. Depth refers to 
the number of layers in a neural network model. The greater the number of layers 
designed, the greater the depth and complexity of the model. A deep FNN has better 
classification and memory ability, as well as a stronger function fitting ability. It 
can handle more complex data structures or data whose structures are difficult to 
predefine. Figure 7 shows a neural network containing four hidden layers. 
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Fig.7 A multi-layer feedforward neural network containing four hidden layers 
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The output of the network can be expressed as follows: 

z® = WO x qi) 4+ Dp (10) 

a® = fiz) (11) 


where f;(e) denotes the activation function of neurons in layer/, W® and b denote 
the weight matrix and bias from layer/ — 1 to layer l, respectively, z and a” denote 
the net input and output of neurons in layer J. 

Deep FNN also uses a BP algorithm to optimize the network model. The represen- 
tation ability of the network is greatly enhanced because of the increase in the number 
of layers and parameters. However, the network is prone to overfitting; that is, small 
errors can be obtained from the training data, and large errors are obtained from 
the test data. Therefore, appropriate regularization techniques need to be applied to 
improve the generalization ability of the network model. 

Deep FNNs are the basis of many AI applications. They can perform complex 
data processing and pattern recognition tasks. However, they cannot process image 
data well. The networks require a large number of neurons and parameters when 
processing the image data; with high computational ability requirements and low 
computational efficiency, these networks are prone to overfitting. 


2.2 Deep Convolutional Neural Network 


The biggest advantage of CNNs over FNNs is the reduction of parameters. This 
allows researchers to build and design larger models to solve complex problems. For 
example, a picture in jpg format with a resolution of 480 x 480 is represented in 
the computer as a 480 x 480 x 3 tensor, and the three dimensions correspond to 
the height, width, and the number of channels of the 3D tensor. If this image data is 
fed into an FNN, each neuron in the first hidden layer in this network needs to be 
connected to 691,200 (480 x 480 x 3) tensor elements. The number of parameters 
required for one of these neurons is over 600,000. Thus, the number of single-layer 
network neurons required for FNNs to deal with complex problems is enormous. 
Therefore, the hidden layer of the FNN requires a large number of parameters to 
extract the tensor features. This fully connected mechanism of FNNs is inefficient in 
handling large input data. Compared with the fully connected layer of the FNN, the 
convolutional layer requires fewer parameters to extract the tensor features. CNNs 
are widely used in many fields because of their efficient features. 

This section contains four subsections. Section 2.2.1 introduces the mechanism 
of the CNNs and their basic structures. Section 2.2.2 introduces the mechanism of 
full CNNs. Section 2.2.3 introduces typical CNN structures. Section 2.2.4 introduces 
the problems and shortcomings of deep convolutional networks. 
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2.2.1 Mechanism of Convolutional Neural Network 


CNNs obtain their names from the mathematical linear operations between the matri- 
ces called convolutions. A CNN is a representation learning method with a multilayer 
structure, which mainly consists of a convolutional layer, pooling layer, and a fully 
connected layer. The convolutional and fully connected layers contain parameters, 
whereas the pooling layer does not. As shown in Fig. 8, the image is input to the CNN 
and then passes through the convolutional and pooling layers alternately, which flat- 
tens the image features into a feature vector of dimension one. The CNN finally 
outputs the result through the fully connected layer. The convolutional and pooling 
layers are equivalent to feature extraction structures, which are used to extract fea- 
tures from the input tensor. The fully connected layer is equivalent to a classifier, 
which is used to classify the flattened feature vector. 

As the name implies, the convolutional layer is the most important operation 
in a CNN, and the parameters of the convolutional layer are mainly located in the 
convolutional kernel. The convolution kernel is usually represented by a small-size 
tensor that only acts on a local region within the space of the input tensor. 

Taking the most widely used two-dimensional convolution as an example, the 
specific process of the convolution operation is shown in Fig.9a. The convolution 
operation selects all local regions in the spatial dimension of the input features that 
are consistent with the size of the convolution kernel. The calculation is shown in 
Fig.9a. The input tensor shares the convolution kernel parameters in the channel 
dimension when performing the convolution operation. The convolution operation is 
a three-dimensional inner product operation between the shared convolution kernel 
parameters along the channel and the features of the same spatially localized region 
on different channels. The corresponding result is a scalar of the center position of 
the corresponding spatially localized region. The output corresponding to a single 
convolution kernel is a two-dimensional feature map, and the number of convolution 
kernels in the convolution layer is the same as the number of channels in the output 
tensor. 

There is some degree of information overlap between the output results of the 
convolutional layer and its neighboring outputs in the spatial dimension. This may 
lead to information redundancy. As shown in Fig. 9b, the input feature map size of 
the convolution operation is 7 x 7. If the selected convolution kernel size is 3 x 
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Fig. 8 Overview of the convolutional neural network architecture. The architecture mainly consists 
of several convolution layers, pooling layers and full connection layers 
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Fig. 9 Illustration of Convolution Implementation Mechanism. a Convolution kernel parameters 
are shared in the input tensor channel dimension. b The sliding mechanism of the convolution 
kernel in the dimension of the input tensor space. c Padding operation in the process of convolution 
implementation 


3, then there will be an overlap of six elements between two adjacent convolution 
operations. If the step size parameter is set to two, there will be an overlap of three 
elements between two adjacent convolution operations. The size of the output feature 
map of the convolution operation is 3 x 3. A reasonable increase in the step size 
reduces the overlap between adjacent convolution operations and reduces the size of 
the output feature. 

Convolution operations lose information about edge features, which is an inherent 
drawback of convolution operations. A simple and effective method is padding. As 
shown in Fig.9c, if the size of the convolution kernel is 5 x 5 and the padding 
parameter is set to two, the size of the output feature map of the convolution operation 
can be kept consistent with the input to avoid the loss of the edge information. 

Although the convolutional layer can serve to reduce the number of connections 
between the output neurons and the input features, the number of neurons is not sig- 
nificantly reduced during the execution of the convolutional operation, and the feature 
dimensionality remains high. To make the model easier to optimize, pooling layers are 
introduced to the CNN. The pooling layer is often located between two convolutional 
layers and is designed to perform feature selection and reduce the dimensionality 
of the feature mapping. There are two mainstream pooling approaches: maximum 
pooling and average pooling. As shown in Fig. 10, a 2 x 2 filter is used on the input 
features with a spatial dimension of 4 x 4 and slides with a step size of two. The 
maximum and average feature values in the corresponding position of each filter are 
determined, and the maximum pooled feature map and the average pooled feature 
map are the outputs. 
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Fig. 10 Comparison of two 
different pooling operations 


The fully connected layer in a CNN is composed of the single-layer feedforward 
network introduced in the previous section. The number of layers of this fully con- 
nected layer can be chosen according to the specific needs of the task. Compared 
with the convolutional layer, the fully connected layer contains a large number of 
parameters and dense connections. Therefore, more fully connected layers will lead 
to a more difficult model optimization; thus, the number of fully connected layers in 
mainstream CNNs does not exceed three. 


2.2.2 Mechanism of a Fully Convolutional Neural Network 


In contrast to the CNN, the fully CNN does not contain a fully connected layer; 
its structure is shown in Fig. 11. Each element of the output layer of the convolu- 
tional and pooling operations represents the local information of the input features, 
whereas the fully connected layer depicts the global information of the input features; 
thus, the convolutional and pooling operations can preserve the spatial dimensional 
information of the input tensor. A fully CNN composed of all convolutional and 
pooling operations can output a feature map that retains the spatial location infor- 
mation. Compared to traditional CNNs, fully CNNs are more suitable for tasks such 
as image segmentation, where both the input and output are images. 


2.2.3 Common Convolutional Neural Networks 


During the development of CNNs, several representative networks have emerged, 
such as VGGNet, ResNet, and DenseNet. In this section, we introduce these networks. 

VGGNet has two structures of 16 and 19 layers, as shown in Fig. 11. All the 
convolution kernels in the VGGNet network are of size 3 x 3. VGGNet cascades 
three sets of convolution operations with a kernel size of 3 x 3 and a step size of 
one. These three sets of convolution operations are equivalent to one convolution 
operation with a kernel size of 7 x 7. This has two main benefits. First, a deeper 
network structure will learn more complex nonlinear relationships, which will lead to 
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Fig. 11 Overview of the fully convolutional neural network architecture. The input and output of 
this architecture are pictures 


better results for the model. Second, this reduces the number of parameters required 
to perform the convolutional operations. 

In 2015, Kaiming He proposed a 152-layer ResNet to win the 2015 ILSVRC com- 
petition with a top-1 error record of 3.6%. The proposed ResNet was revolutionary 
because the network introduced a residual mechanism. Compared to the traditional 
convolutional module, the convolutional module in the residual network learns only 
a small variation in the input features. The output of the residual convolution module 
is equivalent to the superposition of the input features and the amount of variation in 
the input features. When BP is performed, the residual structure retains some of the 
gradient information and alleviates the gradient disappearance problem. Therefore, 
the ResNet model is deeper and has a better performance. 

ResNet uses jump connections to pass gradients directly from the back layer to 
the front layer. However, features that have been jump-connected and features that 
have been convolutionally transformed need to be summed before they can be output. 
This may affect the information propagation in the network. In response, Huang et 
al. [27] proposed the network structure of DenseNet, which is based on ResNet. 
Unlike ResNet, which only forms a jump-connected module with the previous layer, 
DenseNet achieves feature reuse by directly connecting each layer to its preceding 
layer. Compared to ResNet, DenseNet not only reduces the error rate but also reduces 
the number of parameters in the network. 


2.2.4 The Shortcomings of Convolutional Neural Network 
The convolutional operations of CNNs can be divided into one-dimensional, two- 


dimensional, and three-dimensional convolutions according to the dimensions of 
the convolutional kernel. One-dimensional convolution is mainly applied to tasks 
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related to one-dimensional sequence signals, such as EEG signal analysis, speech 
signal processing, and radio signal classification. Two-dimensional convolution is 
mainly used in the image field, such as image super-resolution and image denoising; 
image restoration, such as the image processing field, image recognition, and target 
detection, and semantic segmentation, such as the computer vision field. Three- 
dimensional convolution is commonly used in medical CT image segmentation and 
video motion recognition, etc. 

Although CNNs have powerful feature extraction capabilities, they are based on 
the assumption of mutual independence between consecutive input samples. There- 
fore, the network is difficult to apply to tasks where there is an inherent logical 
relationship between successive inputs. 


2.3 Deep Recurrent Neural Network 


2.3.1 Mechanism of Convolutional Neural Network 


Jordan [29] proposed Jordan Network in 1986 and designed a memory mechanism 
that fed back the output of the entire network to the input layer of the network the 
next moment. One of the foundational works done on RNNs is the simple recurrent 
networks (SRNs), which was proposed by Elman in 1900 [12]. The SRN was modified 
on the Jordan Network, and the output of the hidden layer in the network is shown 
below. Feedback to the input layer occurs at any given moment. The Jordan Network 
uses the entire network as a loop, whereas the SRN only uses the hidden layer as 
a loop. Therefore, the SRN is more flexible to use; it also avoids the problem of 
conversion between network output dimensions and input dimensions. 

The structures of the SRN and the widely used RNN are similar. The output value 
of the RNN at the next moment is jointly determined by multiple past moments. In 
fact, there is often a problem that the output of the network is affected by the future 
inputs. 

Driven by similar ideas, Schuster and Paliwal [45] improved the traditional RNN 
by designing a bidirectional cyclic neural network (bidirectional RNN, BRNN). The 
BRNN is a superposition of two RNNs in opposite directions. Each hidden layer must 
record two values for both the positive and negative directions. The final output value 
depends on the RNN calculated in the forward and backward directions (Fig. 12). 

Hochreiter and Schmidhuber [25] proposed an LSTM. In RNNs, owing to the 
uncertain attenuation of information in the cyclic structure, it takes a lot of time to 
learn to store information over a certain time interval through the BP of the periodic 
structure. In this problem, the solution used by LSTM is the structure of the input 
gate, and the output gate is designed in the RNN to control the state and output of the 
loop unit at any given time. Gers et al. [16] proposed that LSTM did not have a clear 
prior end mark when processing long sequence inputs. Hence, the original LSTM 
added a forget gate mechanism, which allowed the LSTM to learn to reproduce at the 
appropriate time and set itself to release the internally stored information. One year 
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Fig. 12 The structure of the recurrent neural network. The blue neuron represents the time series 
of sequential input. The purple neuron represents the output of the recurrent network. The pink 
neurons represent the information transmitted in the middle 


later, Gers and Schmidhuber [15], made improvements to the LSTM again, adding 
a peephole connection; the value memorized in the last moment was also used as 
the input of the gate structure; thus the structure of LSTM gradually improved to its 
present-day state. 

Graves and Schmidhuber [18] proposed a bidirectional LSTM (BLSTM), which 
combined the BRNN and LSTM. Graves et al. [19] applied the LSTM to handwritten 
text recognition tasks, which overcame the difficulty faced by traditional models in 
segmenting scribbled and overlapping text. The offline text recognition rate was 
74.1%, which was the best at that time. Graves et al. [20] combined deep networks 
and RNNs and proposed a deep recurrent network based on LSTM, which was 
applied to speech recognition tasks. The error in the TIMIT dataset was only 17.7%, 
which was the best at that time. In 2014, Cho et al. [7], simplified its gate structure 
by replacing the input gate, forget gate, and output gate of LSTM with an update gate 
and reset gate, and proposed a gated recurrent unit for the first time. For traditional 
statistical machine translation, the encoder-decoder model was proposed. 

Owing to the further development of the LSTM, the LSTM has been widely used 
in various applications of natural language processing and many variants have been 
developed. Next, we introduce LSTM and ConvLSTM in detail. 


2.3.2 Mechanism of LSTM 


Recurrent connections can improve the performance of neural networks by leveraging 
their ability to understand sequential dependencies. However, the memory produced 
from recurrent connections can be severely limited by the algorithms employed 
for training the RNNs. To date, all models have been affected by the explosion or 
disappearance of gradients in the training phase, which renders the network unable 
to learn the long-term order dependence in the data. The LSTM was specifically 
designed to solve this problem. 

LSTM is one of the most commonly used and effective methods for reducing the 
vanishing gradient and exploding gradient. It changes the structure of the hidden 
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Fig. 13 The structure of LSTM 


unit, and the input and output of the neural unit are controlled by the gates. These 
gates control the information flow of hidden neurons and retain the features extracted 
from the previous time steps. 

As illustrated in Fig. 13, an LSTM-based recurrent layer maintains a series of 
memory cells c; at time step t. The activation of the LSTM units can be calculated 
by: 

h, = o, x tanh(c,) (12) 


where tanh(-) is the hyperbolic tangent function, and O, is the output gate that controls 
the extent to which the memory content is exposed. The output gates are updated as: 


o, =0 (W7 x f? +Us x h, + bo) (13) 


where W, and U, are the input-output weight matrix and memory-output matrix, 
respectively, and b, is the bias. The memory cell c; is updated by partially discarding 
the present memory contents and adding new contents of the memory cells ¢; 


C= fr OG1 +t @G (14) 

where & is the element-wise multiplication. The new memory contents are 
G =tanh(W! x f” +U! x h1 + be) (15) 
Here, W, and U. are the input-memory weight matrix and hidden memory coefficient 
matrix, respectively; be is the bias; i; is the input gate, which modulates the extent 
to which the new memory information is added to the memory cell; f; is the forget 


gate, which controls the degree to which the contents of the existing memory cells 
are forgotten. The gates are computed as follows: 


ip =0(Wf x f” +UF x h1 + Vi x c1 + bi) (16) 


I 


fi =o(We x f” +UF x hi tV x ci + bf) (17) 
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where W;, U;, and V; are the input-memory weight matrix of the input gate, hidden- 
memory coefficient matrix, and output weight matrix of the previous cell state, 
respectively; Wy , Uy, and V+ are the input-memory weight matrix of the output 
gate, hidden-memory coefficient matrix, and output weight matrix of the previous 
cell state, respectively; b; and by are the biases. 


2.3.3 Mechanism of convLSTM 


The CNN does not have additional complex operations of artificial neural networks 
for preprocessing and spatial distribution. Therefore, it uses a unique fine-grained 
feature extraction method to automatically process the spatial data. When dealing 
with time features, LSTM can effectively avoid the disappearance of valid informa- 
tion because of the long data interval span. There are certain limitations if parallel 
CNN and LSTM are used to extract spatial and temporal features. For example, in 
a parallel structure composed of a CNN and LSTM, the input and output of the two 
are relatively independent, and the extraction of the relationship between different 
features is ignored. 

Conv-LSTM was born out of a precipitation prediction problem [54]. The problem 
is as follows: given a map of the precipitation distribution for the first few hours, 
predict the precipitation distribution for the next few hours. This was accomplished 
by replacing the input-to-state and state-to-state parts of the LSTM from feedforward 
calculations to convolution calculations. A cell diagram is shown in Fig. 14. 


forget gate 


output gate 


Fig. 14 The structure of Conv-LSTM 
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The principle of Conv-LSTM can be expressed by the following formula: 


Si = o (Wyp * x + Wap * Mey + Wep © G:-1 + Dp) (18) 

i, = o (Wxi * Xp + Whi * Mi + Wei O cr-1 + bi) (19) 

Cr = fi 0Cy-1 + i, o tanh(W xc * x; + Wpc * hi1 + be) (20) 
or = o (Wro * Xt + Who * hi- + Woo © Cr-1 + bo) (21) 
h; = o; o tanh(c;) (22) 


where denotes the convolution and x, c, h, i, f, o are tensors. We can imagine Conv- 
LSTM as models that work on the eigenvectors of two-dimensional grids. It can 
predict the time-space features of the central grid based on the time-space features 
of the points around them. 


2.4 Deep Generative Adversarial Network 


CNNs and RNNs have been widely used in various fields, and have achieved good 
results. However, these methods need to rely on a large amount of labeled data. In 
actual research, we often encounter insufficient training sample data. This situation 
will lead to a decline in the recognition accuracy of our model. A generative adver- 
sarial network (GAN) can generate realistic sample data. If these generated sample 
data are used to train the model, the problem regarding the amount of training sample 
data can be solved. At present, GANs have been widely used in the fields of image 
and vision. 


2.4.1 Architecture of Generative Adversarial Network 


GAN [17] originated from the two-person zero-sum game. The two-person zero-sum 
game is a concept in game theory, which says that the sum of the interests of both 
parties in the game is always zero or remains unchanged. If one party gains, the other 
party must have a corresponding loss. GAN is composed of two parts: a generator 
and a discriminator. These two parts can be regarded as the two parties of the game. 
The optimization process of a GAN is equivalent to the two-person zero-sum game 
process. 

The purpose of the GAN is to learn the distribution of real data, which can generate 
realistic data. The implementation of a GAN is shown in Fig. 15. 

The generator G is used to capture the distribution of real data, and random noise 
is used as an input to generate the sample data. To capture the real data distribution, 
first, a random noise z that obeys the prior distribution P,(z) is given. Then the 
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Fig. 15 The basic structure of the generative adversarial network 


mapping space G(z; 0) is constructed using P,(z). The mapping space G (z; 6,) is 
a generative model of the parameters 0,. The random noise z is used as the input of 
the generator, and sample data is generated through the mapping space G(z; 6,). 

The discriminator D is used to determine whether the input sample is a real sample 
or a generated sample, which is equivalent to a two-classifier. Defining the mapping 
function D(z; 04), the input sample outputs a scalar between zero and one through 
the mapping function. This scalar represents the probability that the input sample is 
a real sample; the mapping function D(z; 04) is a discriminator for the parameters 
04. It should be noted that the input of the discriminator consists of two parts. One 
part is the sample generated by the generator, and the other part is the real sample x 
that obeys the real data distribution. 


2.4.2 Training of Generative Adversarial Network 


In the GAN training process, generator G and discriminator D compete with each 
other, continuously alternating the iterative optimization. Finally, they gradually 
reach an equilibrium. The optimization function of GAN is as follows: 


max max V (D, G) = Ey~ Py, llog DEN + Ero llog — DGI 23) 


where D(x) represents the probability that the real sample x is identified as a real 
sample after passing through the discriminator. G(z) represents the sample data gen- 
erated by random noise z through the generator. D(G(x)) represents the probability 
of the generated sample data being judged as a real sample after passing through the 
discriminator. 

It can be seen that the optimization function of the GAN is equivalent to a min- 
max optimization problem. This optimization function has two steps. The first step 
is to optimize the discriminator D, and the second step is to optimize generator G. 
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The objective function can be regarded as alternately optimizing the following two 
objective functions: 


max V(D, G) = Ex~Piga(xllog D(x)] + Ez~P,@llog(l — D(G@)))]_— (24) 
min V(D, G) = E£,~p,@llog( — D(G@)))] (25) 


The label of the real sample is artificially defined as one, and the label of the 
generated sample is zero. The purpose of the discriminator is to distinguish between 
true and false samples. Thus, the discriminator hopes that D(x) is as close to one as 
possible, and D(G/(x)) is as close to zero as possible. Using sample data to optimize 
the discriminator based on these two conditions is equivalent to maximizing V(D,G). 
The purpose of the generator is to generate sufficiently realistic sample data; hence, 
the generator hopes that D(G(z)) is as close to one as possible. For this purpose, using 
sample data to optimize the generator is equivalent to minimizing V(D,G). 

It should be noted that the parameters of G or D are always fixed, and the param- 
eters of the other part are updated during training. Finally, P(z) and Paata(x) are 
infinitely close. The generator can generate samples in which the discriminator can- 
not distinguish authenticity. 


2.4.3 Typical Generative Adversarial Networks 


The conditional GAN [30] adds constraints to the standard GAN to guide the data 
generation process, thereby generating controllable samples. It solves the problem 
of GAN image generation being too free and difficult to control. The constraint 
condition can be the category label or semantics of the image. The implementation 
process of the conditional generation confrontation network is illustrated in Fig. 16. 

The deep convolution generative adversarial network (DCGAN) [51] combines 
CNN and GAN to optimize the original GAN model from the network structure. 
DCGAN replaces the generator and discriminator in the original GAN with two 
CNNSs to improve the quality of sample generation and the speed of model conver- 
gence. 


2.4.4 Application of Generative Adversarial Network 


A GAN can generate real-like samples without explicitly modeling any data distri- 
bution in the process of generating samples. Therefore, GANs have a wide range of 
applications in many fields, such as images and text. 

One function of the GAN is to generate the data. A limitation of the development of 
deep learning is the lack of training data; only GAN-generated data can compensate 
for this shortcoming. For example, given the text description of a bird, such as some 
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Fig. 17 Cross-modal image generation. The generative adversarial network can realize the task of 
generating from text to image 


black and white on its head and wings, and a long orange beak, a trained GAN can 
generate images that match the description. Figure 17 shows the application of GAN 
in image generation 

Another important application of GAN is image super-resolution, which refers 
to the process of recovering high-resolution (HR) images from low-resolution (LR) 
images. This is an important class of image processing techniques in computer vision 
and image processing. It enjoys a wide range of real-world applications, such as med- 
ical imaging, surveillance, and security amongst others. Other than improving image 
perceptual quality, it also helps to improve other computer vision tasks. However, 
generally, this problem is very challenging and inherently ill-posed since there are 
always multiple HR images corresponding to a single LR image. As shown in Fig. 18, 
relying on powerful image generation capabilities, GAN can decode and encode LR 
images into HR images. 

The task of image translation can also be achieved through GAN. Image trans- 
lation is the conversion of one (source domain) image to another (target domain) 
image. During the translation, the content of the source domain image will remain 
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Fig. 19 Schematic diagram of image translation. This task merges two images to get a brand-new 
image, the generated image retains the content of one input and the style of the other input 


unchanged. Nevertheless, the style or other attributes will be the same as the target 
domain, as shown in Fig. 19. 

Image restoration is a technology that uses the learned image information to com- 
plement or modify the damaged image. Image restoration has various applications 
such as image completion and image deblurring. Owing to its good ability to fit the 
real distribution, GAN has shown good results in image restoration. 


2.4.5 Problems of Generative Adversarial Network 


GAN has become a popular research topic in recent years. Despite its recent genesis, it 
has developed rapidly and has made important contributions in many fields. However, 
owing to problems such as model collapse and gradient disappearance, its generation 
effect, training efficiency, and application range are still restricted. 

(1) Low image-generation diversity 

The diversity of image generation has always been an important issue in the field 
of GAN research. Traditional GAN algorithms can only fit simple datasets with 
small sizes, and the complexity of image generation is low. Therefore, the GAN 
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algorithm has been developed for image diversity. The existing GAN algorithm can 
generate indistinguishable high-quality images; however, many factors restrict the 
development of image diversity, which often conflict with other factors such as image 
size and model complexity. 

(2) Insufficient model training efficiency 

GAN has training instability problems, which are caused by model collapse and 
gradient disappearance. In addition, the complex model structure and redundant 
information causes the training cycle to be too long. 

(3) The application field has not been extensively studied 

GAN has been used in many fields in a relatively short period; Nonetheless, it is 
mostly limited to image processing. Many algorithms mention only their achievable 
functions without explaining their use-value. This development is slow in other fields 
such as NLP. 


3 Perceptual Understanding Based on Neural Network 


3.1 Recognition Based on Neural Network 


Various neural network architectures support a wide variety of perceptual understand- 
ing applications. Currently, research on neural networks in natural language process- 
ing, visual data processing, speech signal processing, etc. is progressing rapidly. They 
have been widely used in industrial fields such as intelligent security, medical health, 
and industrial inspection. Figure 20 briefly depicts the applications of neural network- 
based deep learning research in some important fields. This section introduces neural 
network-based recognition, segmentation, and prediction applications. 


3.1.1 Problem Description 


Neural network-based recognition tasks involve both the extraction of features from 
the model input content and the establishment of mapping relationships between the 
extracted features and the identifiable attributes of the sample (category, location, 
etc.). The advent of CNNs has led to the rapid development of neural networks 
for visual recognition tasks. In this subsection, the classification, localization, and 
detection in recognition tasks are described. 

Figure 21 depicts the flow of a neural network-based recognition task. First, the 
input image is subjected to a CNN to extract key features and represent them in a 
one-dimensional feature vector. Then, this feature vector is input to the classification 
module, localization module, or target detection module according to the different 
tasks; the corresponding output results are subsequently obtained. The classification 
module is used to determine the category of the target in the input image. The 
localization module is used to determine the location of the target in the picture. 
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Fig. 20 Deep learning-based applications in computer vision. It briefly depicts applications of 
neural network-based deep learning research in some important fields 
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The target detection module is a combination of classification and localization, that 
is, it determines the location of the target in the input image and its corresponding 
category. The next section describes each of these three recognition tasks and their 
corresponding modules. 


3.1.2 Classification 


In computer vision, image classification is a crucial job. The objective of image 
classification is to discern what category the object in the image belongs to, such as 
whether it is a cat or a dog. As shown in Fig. 22, according to the difficulty of the 
classification task, it can be subdivided into dichotomous classification task, multi- 
classification task, or multi-label classification task, etc. The image classification 
task can be expressed by the following equation. 


C= Fcl Feonv+ pool (X)] (26) 
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Fig. 22, Overview of a classification-based neural network. It includes binary classification, multi- 
class classification and multi-label classification 


C denotes the category, foonv+poot denotes the convolutional and pooling layers, X 
denotes the input image, and ffo denotes the fully connected layer. 

Binary classification is the most basic type of image classification. It is used to 
identify whether the input image contains a certain category; the classification results 
are represented by zeros and ones. The binary classification task is used to identify 
whether the target is visible in the input. If the model result is zero, then it means 
that the input image does not contain the target object and vice versa. 

Multi-category image classification is more widely used than binary classification. 
Its purpose is to classify the corresponding target class of an image that contains only 
one target class. Multi-category tasks are used to identify the specific category of the 
target in the input image, and the picture input to the network often contains only 
one category of targets. Multi-category image classification has now been integrated 
into all aspects of life and has been effectively used in a variety of sectors, such as 
facial recognition. 

Multi-label classification is used to identify all the categories present in the input 
images. The pictures processed by this task often contain several different labels, 
and these labels are compatible with each other. The multi-label classification task 
can describe the information of pictures more graphically and has a more realistic 
meaning. 

The success of deep learning classification tasks is inextricably linked to the 
development of supervised learning. The construction of large-scale datasets and the 
development of computational resources have made it possible to train the neural 
network parameters. The loss function, a metric used to measure how well a model 
predicts results, is an important component of the classification task and plays an 
important role in BP to update the network parameters. The purpose of the loss 
function is to update the parameters of the model to achieve better prediction results. 
Take the most commonly used cross-entropy loss function as an example, its loss 
function can be written as follows. 


Loss = — } | yi log(p(xi)) (27) 


i=1 
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Fig. 23 Comparison of categories between traditional classification and fine-grained image clas- 
sification. Fine-grained image classification can be used to distinguish different types of fighter 
jets 


where p(x;) denotes the probability value that sample x is predicted to be the i-th 
category, y; denotes the true label of the corresponding input sample x in the i-th 
category, which is one if it belongs to the i-th category, and zero otherwise. 

For datasets such as ImageNet, which has more than 10 million images and 20,000 
classes, the image classification level computer has surpassed that of humans; how- 
ever, deep neural networks are not effective when recognizing subclasses under tra- 
ditional categories, that is, discriminating between magpies and sparrows, etc., under 
the category of birds. Furthermore, the training of the model requires a large number 
of manually labeled tags, which is expensive. The cost of labeling increases expo- 
nentially with the number of targets and the difficulty of discernibility. To address 
these two challenges, fine-grained image classification and unsupervised image clas- 
sification have emerged. 

The distinction of fundamental categories and the performance of finer subcate- 
gories are the foundations of fine-grained image classification. Fine-grained image 
classification may be used to discriminate between sub-categories, such as various 
types of fighter airplanes, as illustrated in Fig.23. Fine-grained image categoriza- 
tion may also be used to discern between different automobile types and battleship 
models, for example. This categorization has a wide range of practical uses. 

Fine-grained images have a more similar appearance and features than coarse- 
grained images. In addition, there are the effects of poses, perspective, illumination, 
occlusion, and background interference, etc. in the acquisition. As a result, the data 
has a huge inter-class variability and a modest intra-class variability. This makes 
classification more difficult. 

The above classification tasks are achieved via supervised learning. Each sample 
has its corresponding label, and the deep neural network is used to continuously 
learn the features corresponding to each label and achieve the classification. In this 
case, the size of the dataset and the quality of the labels often play a decisive role 
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in the performance of the model. High-quality datasets naturally bring difficulties in 
labeling and need a lot of human and financial resources. The aim of unsupervised 
image classification is to classify samples into several classes without using label 
information, thus greatly reducing the labor and time costs associated with data 
labeling. 


3.1.3 Localization 


The purpose of image localization is to determine the location of the target in the input 
image. An image is input to the model, and the model outputs the center coordinates, 
width, and height of the target location in the image. The result is a rectangular box 
with the location of the target in the input image, represented by the horizontal and 
vertical coordinates of the rectangular box and its width and height. 

For example, a typical regional proposal network (RPN) incorporates a dichoto- 
mous classification problem, that is, whether the object in the location is an object 
or not. Rectangular boxes of different sizes and aspect ratios are first generated on a 
sliding window and labeled positively or negatively. The sample data for the RPN is 
organized as a binary classification labeling problem with multiple rectangular boxes 
within the input image and the presence or absence of objects within each rectangular 
box. The RPN maps each sample to a probability value and four localization values. 
The probability value reflects the probability of having an object in a rectangular 
box, and the four localization values are used to regress the center horizontal and 
vertical coordinates and the width and height of the target object location. 


3.1.4 Detection 


The image detection problem is equivalent to a combination of localization and 
classification problems. It needs to locate the location of the target and to deduce the 
class of the located target. 

Current mainstream target detection algorithms can be divided into two-stage and 
one-stage detection. The former frames detection as a “coarse to fine” process, while 
the latter defines it as a “one-step completion” 

The two-stage detection algorithm is relatively slower but more effective. Take the 
most typical two-stage target detection algorithm, faster RCNN, as an example. As 
shown in Fig. 24a, the first stage locates the target location in the input image through 
the region suggestion network. Then in the second stage, the localized targets are 
classified; finally, the rectangular box of the localized targets and their categories 
are obtained. There is a certain sequence between the two stages of this model, 
which makes a stronger connection between localization and classification. This 
approach allows for more robust results. The single-stage detection algorithm is fast 
but less accurate. Taking the most typical two-stage target detection algorithm YOLO 
v3 as an example, YOLO directly generates both coordinates and probabilities for 
each category at a time using regression, as shown in Fig. 24b. This improves the 
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Fig. 24 Two representative detection framework architecture diagrams. a Faster RCNN, most 
typical two-stage target detection algorithm. b YOLO v3, an one-stage target detection algorithm 


computational speed of the model but leads to poorer results because of the lack of 
correlation between localization and classification. 

There are two main measures for the target detection task: IoU and confidence. The 
IoU was introduced in the previous section, and the other metric, confidence, is used 
to measure the confidence level of the detected targets. The higher the confidence 
level, the more certain the model is about the output. 

Object detection has both classification and localization capabilities and has 
numerous applications. Examples include face detection, text detection, remote sens- 
ing target detection, pedestrian detection, and automatic detection of traffic signs and 
traffic signals. Target detection is widely used in military investigations, disaster res- 
cue, and urban traffic management, and has gained wide attention in the fields of 
automatic driving, video surveillance, and criminal investigation, etc. 


3.2 Segmentation Based on Neural Network 


Neural network-based segmentation is most often found in vision tasks. The follow- 
ing section introduces image segmentation as an example. 

Image segmentation is different from image classification and monitoring. The 
task of image classification is to identify the content of an image, whereas the task of 
image monitoring is to identify the content of an image and also monitor its location. 
Image segmentation is a pixel-level image classification task based on classifying 
each pixel of an image. 
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Image segmentation is a key operation in image processing. It refers to the use of 
several disjoint regions to represent a complete image, based on the references of 
grayscale, luminance, texture, and other characteristics of the image. It can simplify 
the representation of the image. Features show similarity or consistency in the same 
region, while they show clear differences in different regions. 

Existing image segmentation is generally divided into semantic segmentation, 
instance segmentation, and panoramic segmentation, as shown in Fig. 25. 

The semantic segmentation is to give each pixel a class label without distinguish- 
ing each instance of the same category. Instance segmentation is a combination of 
object detection and semantic segmentation. First, it detects the object in the image 
and then uses semantic segmentation on the detected objects. Moreover, it also dis- 
tinguishes between different instances of the same kind. Panoramic segmentation is 
a combination of semantic segmentation and instance segmentation. It gives each 
pixel a class label while also distinguishing between different instances of the same 
kind. 
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Fig.25 Division of image segmentation tasks. According to the different segmentation tasks, image 
segmentation can be divided into semantic segmentation, instance segmentation and panoramic 
segmentation 
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Fig.26 Example of semantic segmentation based on neural networks. The image above represents 
an application in city street view. The image below represents an application in a marine environment 


3.2.2 Semantic Segmentation 


Semantics refers to the contents of an image. Semantic segmentation is performed 
on a pixel-by-pixel basis to label the class based on the semantics of the image, as 
shown in Fig.26. Semantic segmentation uses feature extraction and classification 
to mark pixels of cars as blue, pixels of trees as green, pixels of buildings as gray, 
and so on. 

There are three algorithm evaluation metrics for neural network-based semantic 
segmentation: accuracy, execution time, and memory consumption. Among these, 
execution time is the most important measure. In practical applications, datasets are 
generally very large, and computer hardware facilities are limited, and only a short 
execution time can make image segmentation more popular in daily applications. 
While memory consumption is also an important factor affecting semantic segmen- 
tation, memory is expandable in most scenarios. Accuracy is the most critical metric 
for semantic segmentation. Pixel accuracy and mean IoU are the common forms of 
accuracy. The number of successfully categorized pixels divided by the total number 
of pixels is the pixel accuracy. The mean IoU calculates the ratio of the intersection 
and the union of two sets, and in the field of semantic segmentation, the true and 
predicted values are the embodiment of the two sets. 

The initial neural network-based semantic segmentation models are AlexNet, 
VGGNet, and ResNet. The emergence of fully convolutional networks later broke 
the previous segmentation method; thus, the accuracy in the PASCSL VOC dataset 
has substantially improved. FCN has greatly promoted the development of semantic 
segmentation algorithms. SegNet [1], RefineNet [32], PSPNet [55], and DeepLab [6] 
have been proposed one after another, and all of them have achieved good results. 


3.2.3 Instance Segmentation 
The network architecture of semantic segmentation aims to optimize the accuracy 


of segmentation results and improve segmentation efficiency; this should allow for 
applications in the field of image semantic real-time processing. However, seman- 
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Fig. 27 Example of semantic segmentation based on neural networks. The mask and label of each 
object in the figure need to be segmented 


tic segmentation can only judge categories and cannot distinguish individuals, and 
it is impossible to accurately understand semantic information or parse scenes in 
many complex real scenarios. The emergence of an instance segmentation algorithm 
effectively solves this problem. 

The concept of instance segmentation was first proposed by Hariharan et al. [21], 
who aimed to detect objects in the input image and assign category labels to each 
pixel of the object. As shown in Fig. 27, unlike semantic segmentation, instance seg- 
mentation is able to distinguish between different instances with the same semantic 
category in the foreground. 

Instance segmentation is essentially a combination of semantic segmentation and 
object detection. It not only has the characteristics of semantic segmentation to 
classify images at the pixel level but also has the characteristics of object detection 
to locate different instances of the same category in an image. As shown in Fig. 27, the 
instance segmentation technique based on deep neural networks usually consists of 
three parts: the image input, instance segmentation model, and the segmentation result 
output. First, a deep network model is designed according to the actual requirements, 
and the original image data is directly input to the network to extract image features. 
After the high-level abstract features are obtained, the instance segmentation model 
is used to process. The processing can first determine the location and category of 
object instances via object detection. It then performs segmentation in the selected 
region, or it can first implement the semantic segmentation task and then distinguish 
different instances. The final output is an instance segmented image with the same 
resolution as the input image. 

In recent years, instance segmentation techniques have been rapidly developed. 
Mask R-CNN [22], which was developed based on the two-stage detector faster 
RCNN [42], is a direct and effective instance segmentation method. It has become the 
basic framework for some instance segmentation tasks because of its high accuracy 
and stability. YOLACT [3], an instance segmentation algorithm extended from a 
single-stage detector with high-speed detection, achieves real-time segmentation of 
video information and can obtain efficient processing ability with a small loss of 
accuracy. 
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3.2.4 Panoptic Segmentation 


Panoramic segmentation combines the tasks of semantic segmentation and 
instance segmentation to generate a global unified segmented image. Instance seg- 
mentation only detects the objects in the image and segments the detected objects. 
Panoramic segmentation detects and segments all objects in the image including the 
background, achieving a panoramic understanding of the image. Figure 28 shows the 
panoramic segmentation. The information predicted by the panoramic segmentation 
is the most comprehensive, which includes not only the category classification of 
all pixel points using semantic segmentation but also the function of distinguishing 
between different instances in the instance segmentation task. 

Compared to semantic segmentation, the difficulty of panoramic segmentation is 
to optimize the design of the fully connected network so that its network structure 
can distinguish between different categories of instances. The goal of panoramic 
segmentation is to assign a semantic label and an instance an ID to each pixel in the 
image, where the semantic label refers to the category of the object and the instance 
ID corresponds to different numbers of similar objects. Therefore, the overlapping 
phenomenon in instance segmentation cannot occur in panoramic segmentation. 

The basic process of panoramic segmentation is shown in Fig. 29; it is mainly 
divided into feature extraction, semantic segmentation and instance segmentation 
processing, and sub-task fusion. The purpose of feature extraction is to obtain the fea- 
ture representation of the input image and provide the necessary information for the 
two subsequent tasks. It relies on deep neural networks, and the main networks used 
include VGGNet, ResNet, MobileNet. The extracted features are shared by semantic 
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Fig. 28 Example of panoramic segmentation based on neural networks. Panoramic segmentation 
task is a combination of semantic segmentation task and instance segmentation task 
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Fig. 29 The processing flow of panoramic segmentation 
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segmentation and instance segmentation. The semantic segmentation branch pro- 
duces semantic segmentation predictions, while the instance segmentation branch 
produces instance segmentation predictions. The subtask fusion processes and fuses 
the prediction results of the above two branches in an appropriate way to produce 
the final panoramic prediction. Many works have been done to adopt the above basic 
process, such as JSIS-Net [9], AU-Net [31], Single network [10], OANet [34], etc. 

Image segmentation can convert images into more meaningful and analyzable 
content expressions, which can effectively improve the processing efficiency of sub- 
sequent vision tasks. It is the basis of the computer vision scene being able to under- 
stand images and plays an important role in many scenes. In the field of medical image 
processing, by processing CT images of the organs of patients, it can accurately locate 
the boundaries of lesions; it can also automatically determine the location, shape, 
and size of the diseases, and assist doctors in lesion detection. In the field of remote 
sensing image processing, it can efficiently survey and plan the geographic spatial 
information such as topography and landform, water pattern direction, urban dis- 
tribution, and farming planning. In the field of automatic driving, it can judge the 
surrounding environment of the road based on real-time road scenes, including lane 
line direction, traffic signs, and safety position of oncoming pedestrians or vehicles; 
it aims to provide correct guidance to vehicles and ensure driving safety. In the field 
of intelligent security, the object in the surveillance video is located and screened; 
it aims to play the role of security warning or object tracking. Furthermore, image 
segmentation can also be applied to augmented reality, text extraction, and industrial 
sorting, etc. 

With the improvement of computer performance and the continuous optimization 
of image segmentation algorithm architecture, image segmentation technology based 
on deep neural networks has become an important task. While pushing the network 
in the direction of being lightweight, real-time, and highly accurate, more attention 
should be paid to technology implementation and scene promotion. 


3.3 Prediction Based on Neural Network 


This section introduces the application of neural networks to prediction problems. We 
will expand on regression, time-series prediction, one-dimensional signal prediction, 
and two-dimensional video prediction. 


3.3.1 Regression 


The link between the independent and dependent variables is predicted using regres- 
sion [8]. The learning of the regression issue is similar to that of function fitting: 
select a function curve that fits the known data and accurately predicts the unknown 
data. 
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According to the number of independent variables, regression problems are split 
into unary and multiple regressions. Itis separated into linear and nonlinear regression 
based on the relationship between the independent and dependent variables. 

The square loss function is the most often used loss function for regression learn- 
ing. The least-squares approach may be used to tackle the regression problem in this 
scenario. 

Tasks in many fields can be formalized as regression problems. For example, 
regression can be used in the business field as a tool for market trend forecast- 
ing, product quality management, customer satisfaction surveys, and investment risk 
analysis. 


3.3.2 Time Series Prediction 


A time series is a sequence of numbers arranged in a specific order, and this order is 
usually determined by time [53]. It is an important means for people to understand 
the objective world and natural phenomena. 

The development of time-series prediction is divided into two periods. The early- 
stage was before World War II, and financial and economic forecasting was the key. 
The second stage was from the mid-war to the 21st century. In this period, the appli- 
cation areas were more extensive; these include meteorology, aerospace, electronic 
computers, and mechanical vibration, etc. Time-series prediction has become a hot 
field pursued by experts in academic research. 

The rapid development of artificial intelligence has a significant impact on time- 
series prediction methods. Currently, commonly used time-series prediction methods 
can be divided into traditional methods and methods based on deep learning. 

Traditional time-series prediction methods are usually not ideal for non-wide 
stationary time series. Moreover, they are limited in forecasting by complex and 
highly nonlinear time series. 

The emergence of neural networks has solved these problems. Neural networks 
have good learning capabilities. They can learn the underlying laws of the time series 
through multiple iterations based on the data itself. Compared to traditional methods, 
neural network methods are more accurate and can be applied to most time series. 
Among them, RNNs usually perform better when dealing with time-series problems. 


3.3.3 One-dimensional Signal Prediction 


Neural networks are also widely used in the prediction of one-dimensional signals. 

In recent years, some scholars have discovered that high-frequency oscillation sig- 
nals have a certain correlation with particular diseases, which may help to improve 
the accuracy of lesion location and promote the success rate of clinical operations. 
On the other hand, these findings and applications can help us understand the patho- 
physiological mechanism of human brain electrical activity and explore preventive 
treatments that predict disease. 
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3.3.4 Two-dimensional Video Prediction 
Video prediction technology predicts subsequent video frames when several lengths 


of continuous video frames are provided [55]. It is an important topic in the field of 
computer vision and has significant application prospects. 
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Fig. 30 Two-dimensional time series prediction. The typhoon position at time-1 to time-n are 
known, and the typhoon position at time n +1 is predicted 


For example, in unmanned driving tasks, researchers can use the image infor- 
mation of historical frames to analyze the trajectory information of pedestrians and 
vehicles outside the car. In this way, the computer can predict the location of the 
objects outside the vehicle and make judgments in advance. Traffic accidents can 
also be avoided, and the safety of unmanned driving can also be improved. 

In oceanography, the forecast of meteorological elements and ocean elements in 
a certain sea area is similar to the forecast of video sequences. The gridded two- 
dimensional sea area corresponds to a frame in the video, and the change of the sea 
area in a certain period corresponds to the change of the video frame on the timeline. 
As shown in Fig. 30, the figure depicts the neural network prediction of a typical 
typhoon path in the seas of eastern China. The neural network encodes the offshore 
wind current field in the eastern China sea at each known moment and predicts the 
process of the wind current field in the future through the RNN. 
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1 Sea Surface Temperature and Tropical Instability Waves 


With the development of earth observation satellites and various active and passive 
sensors, massive ocean data have been acquired. For instance, the cumulative satel- 
lite data archive volume at the National Oceanic and Atmospheric Administration’s 
National Centers for Environmental Information reached ~7.5 petabytes in 2016. 
The projected volume by 2030 is ~50 petabytes [32]. Many oceanic gridded prod- 
ucts (e.g., sea surface temperature (SST), sea surface winds, and sea surface height) 
have been generated from such deluges of satellite data. These products provide an 
unprecedented golden opportunity for in-depth research and demonstrate the urgent 
need to develop effective methods to explore time-series data. SST can be measured 
from space and has the longest history among satellite-derived oceanic products 
widely used to reveal the evolution of various important oceanic phenomena such 
as El Niño, western boundary current, and tropical instability wave (TIW) [18]. 
Thus, SST is a critical parameter in understating physical oceanography, biological 
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oceanography, and atmosphere-ocean interaction; it is also a key input parameter 
for climate and weather modeling. The models in traditional statistical analysis have 
relatively limited complexity. This could make the models not work well, when used 
to model the oceanic phenomena that are complicated by nature. 

Recently, another new research and application front that utilizes available tremen- 
dous data using deep learning (DL) technology has emerged. With DL, substantially 
more complex models can be built to mine rules deeply hidden in SST data. DL is 
a subset of machine learning that teaches computers to learn and make decisions or 
predictions based on input data. The deep neural network (DNN) technique is one 
of the most popular and powerful DL techniques, achieving successes in computer 
vision and speech recognition [15, 17]. A DNN is a multilayer neural network (NN). 
In most network layers of a DNN, input values are weighted, combined, and then 
transformed by an activation function to incorporate nonlinearity into the network. 
The output values of a network layer are linked to the next layer as input. All weights 
of a DNN are iteratively optimized by combining error backpropagation and gradient- 
based optimization to make the DNN suitable for finding the underlying relationship 
among its inputs and outputs. Such a multilayer structure allows the DNN to learn 
data features with multiple abstraction levels, which is impossible to imagine by the 
human brain [15]. Convolutional layers, named for their mathematical form, are a 
core type of network layer widely used in DNN models. In a convolutional layer, the 
output value at a specific site is calculated by weighting and combining the nearby 
sites’ input values. Each output site shares the same weights. Thus, a convolutional 
layer has fewer weights to be optimized than a traditional fully connected layer that 
uses independent weights to connect all input and output sites. As a result, using 
the convolutional layer is particularly efficient in processing multi-dimensional data. 
Therefore, compared with traditional statistical models, DNN-based DL models can 
be much more complex and thus, after trained by a large quantity of sample data, can 
more efficiently learn the inherent characteristics behind them. Recently, DL appli- 
cations in the prediction of future images in videos have drawn extensive attention 
in the field of computer vision [24, 35]. Ocean SST forecasting is similar to image 
prediction in videos, where future SST maps are forecasted based on the previous 
maps using a DL model. Because of the abovementioned similarities, we believe 
DL technology will help us to model oceanic phenomena in a different and promis- 
ing way that is driven by ever-increasing big ocean data, although DL applications 
in oceanography and other geosciences just begin in recent years [31]. Therefore, 
using the large accumulated amount and long time series of satellite SST data, we 
can build a pure data-driven SST forecasting model that capture the spatial-temporal 
variations of a complicated yet important oceanic phenomenon, TIW, which has 
effects on transport of heat, mass and momentum in the ocean, air-sea and biophysi- 
cal interactions, climate change, etc. As an internally generated ocean variability with 
time scales of approximately 15-40 days, TIWs produce large perturbations to phys- 
ical and biological fields in the ocean, including SST. Furthermore, TIW-produced 
SST perturbations induce almost instantaneous atmospheric surface wind responses, 
forming TIW-scale interactions between the atmosphere and ocean. Although TIWs 
are dominantly controlled by the background ocean state, TIW evolution and pre- 
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dictability are affected by air-sea coupling at TIW scales. TIW forecasting is a chal- 
lenging task because the spatial-temporal variation of TIW is significant, with large 
shape distortions and deformations and seasonal and interannual variability caused 
by the El Nifio-Southern oscillation. Both high-resolution grids of the space domain 
discrezation and realistic parameterizations of the relevant physical processes are 
required, when we use numerically model TIWs. All these lead to substantial dif- 
ficulties in realistic simulation of TIW-related oceanic and atmospheric responses 
and the coupled air-sea interactions. Dynamical equation-based numerical modeling 
for TIWs requires not only high spatial resolution but also realistic parameteriza- 
tions of the relevant physical processes. As a result, substantial difficulties exist in 
realistically simulating TIW-related atmospheric responses and the coupled air-sea 
interactions. Therefore, the data-driven model was applied to the SST field in the 
eastern equatorial Pacific Ocean to show that the TIW propagation can be forecasted 
by the data-driven model. 

Satellite-derived SSTs have long been assimilated into numerical models to 
improve their forecasts. Recently, the NN-based strategy was proposed to perform 
a similar role as data assimilation. For example, in [27], a NN model is used to 
find the bias correction term in a numerical SST forecasting model. Compared with 
a numerical model, a data-driven forecasting model is much simpler and compu- 
tationally efficient. The forecast made by a data-driven model relies only on prior 
data of minimal physical parameters or even one parameter. As another example, an 
SST pattern time series can be expanded as the sum of products of time-dependent 
principal component scales and corresponding space-dependent eigenvectors follow- 
ing empirical orthogonal function (EOF) analysis. Thus, the forecast of the SSTs at 
grids can be approximately reduced to the forecasts of several SST leading principal 
components [40]. Recently, NN models were developed to directly forecast SSTs 
without EOF approximations, including both site-specific and -independent models. 
A site-specific model considers the site difference, so makes SST forecasts with 
different NN models at different sites [26]. However, as each site needs building a 
NN model, the computation coat is high in the NN-training phase of a site-specific 
model, and sufficient NN-training samples are also required at each site. When use a 
site-independent model to forecast SSTs, different sites share the same SST forecast 
model [2, 42, 44]. This makes site-independent models more efficient. However, 
when forecasting a future SST at one site, these recent models only utilize the prior 
SST series at the very close neighboring sites. The models may have limitations over 
a large area because the SST patterns controlled by large-scale phenomena could 
be related to each other within a vast ocean area. Thus, maybe a wider SST series 
centering at a forecast site should be utilized to forecast the future SST. 

In the following section, we introduce a multi-scale scheme DNN with four 
stacked composite layers for SST forecasting in the eastern equatorial Pacific Ocean, 
which overcomes the shortcomings of previous data-driven SST forecasting models. 
The idea of a multi-scale scheme has achieved notable successes in the field of com- 
puter vision, e.g., DNN applications in semantic segmentation [21, 33], but has not 
been explored in the oceanography field. Considering the natural differences among 
different sites, we also build a space-dependent but time-independent bias correction 
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map and then combine it with the multi-scale DNN to develop the final data-driven 
SST forecasting model, named the DL model for brevity. 

The developed DL model was applied to forecast the SST pattern variations asso- 
ciated with the TIWs in the eastern equatorial Pacific Ocean. TIWs are an important 
ocean dynamic phenomenon in both the equatorial Pacific and Atlantic Oceans. They 
were first captured in the current meter records and infrared satellite images in the 
1970s [6, 18]. One prominent characteristic of Pacific TIWs is its cusp-shaped and 
westward-propagating waves at both flanks of the equatorial Pacific cold tongue 
where the north flank has stronger signal. Previous studies have estimated the wave- 
length, period, and phase speed of TIWs from various data sources, and their values 
are typically within the ranges of 600 to 2000 km, 15 to 40 days, and 17-86 km/day 
[3, 4, 12, 13, 19, 28, 29, 38, 39]. Previous studies also suggested that the generation 
of TIWs could be the result of barotropic and baroclinic instability processes of the 
meridional and vertical shear among the westward South Equatorial Current, the 
eastward Equatorial Undercurrent, and the North Equatorial Counter Current [4, 23, 
30, 34]. As a result, TIWs are inactive/active during boreal spring/fall, because the 
current shear is weaker/stronger at that time. Moreover, TIWs are suppressed and 
even indiscernible during strong El Nifio years when the Pacific cold tongue and the 
related equatorial current shear are too weak and vice versa during La Nina years [39]. 
Conversely, TIWs also have feedback to the El Nifio-Southern Oscillation, affecting 
its asymmetry and irregularity [1, 10, 11]. The physical and biological processes of 
TIWs are complicated. As has been widely illustrated, TIWs have a profound effect 
on the distribution of SST, sea surface height anomaly, chlorophyll-a, rain, salinity, 
and winds in the eastern equatorial Pacific Ocean [3, 14, 28, 29, 38]. TIW induces 
horizontal convection and vertical mixing in the upper sea [12, 13, 20, 25]. The 
mixing reaches even the lower half of the thermocline, a fact that is still not well 
considered in most physical models [20]. TIWs affect the equatorial chlorophyll-a 
concentration by transporting nutrients to the upper ocean [7, 9, 43]. Conversely, 
modeling analyses indicate that chlorophyll-a may modulate solar radiation in the 
upper ocean and weaken TIWs [36, 37]. TIWs also interact with the atmosphere 
because of the sea surface wind modulation caused by the TTW-induced SST anoma- 
lies [21, 41, 45-47]. Moreover, a spatial correlation between SST and cloud patterns 
is observed during the TIW seasons. The clouds appearing in the warm troughs of 
the TIWs are usually generated by cool low-level winds crossing the SST fronts 
and, in turn, dampen the TIW-induced SST anomalies by reducing the incident solar 
radiation over the warm troughs [5]. More comprehensive physical models for TIW 
studies are still ongoing, and many of the above-mentioned aspects should be con- 
sidered to make the models more realistic [12, 14, 20, 36, 37, 45-47], which is a 
difficult challenge. In contrast, the time series of data contain all these factors. Owing 
to the strong data-mining ability, a data-driven DL model can automatically learn 
comprehensive rules of SST spatial-temporal variations from the data, and does not 
depict various complex processes by using physical equations. 
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2 Data and Model of SST Forecasting 


There are two parts in the model: a DNN and a constant map. The DNN is multi- 
scale, having a network structure of four stacked composite layers for different spatial 
resolutions. The DNN uses the SSTs from the preceding fourteen steps to estimate 
the SSTs at the following step. The interval between the two steps is five days. 
The DNN-made estimation is followed by the correction with the constant map for 
reducing bias. The details are given below. 


2.1 Satellite Remote Sensing SST Data 


The DL model was built and tested with the SST products of Remote Sensing Sys- 
tems. The products were made from both microwave and infrared sensor measure- 
ments. Our studied area is a rectangular region spanning from 120°W to 180°W in 
longitude and from 10°S to 10°N in latitude. The products from 2006 to 2019 were 
collected in our study. These 9-km-grid products were averaged to the 18-km-grid 
SST data. The SST data were divided into two parts according to time. The first part 
(1st Jan 2006-31st Dec 2009) and the second part (1st Jan 2010-31st Mar 2019) 
were used to build and test the DL model, respectively. By considering that TTWs 
have about a fifteen-to-forty-days temporal scale, the time step of the DL model is 
set to five days. Based on the preceding thirteen and current-step SST maps, the 
DL model forecasts the SST map at the following future time step, the fifth step. 
Therefore, a sample in our study is an SST series consisting of sequent fifteen SST 
maps. Then, the SST series was shifted day by day to get the second, third, fourth, 
etc. The DL model forecasts the fifteenth-step SST map in each series based on the 
first-fourteen-steps SST maps. The forecasted SST map was then validated using the 
series’s fifteenth-step SST map. Approximately one thousand four hundred series 
were generated in the first part of the SST data, and three thousand four hundred 
series samples were generated in the second part of the SST data. It should be noted 
that a significant El Niño event occurred during the period of 2014-2016, which is 
covered by the second part of the SST data. 


2.2 Architecture and Training of the DL Model 


As shown in Fig. 1, the DL model is composed of a trained multi-scale DNN and a 
time-independent bias-correction map. The DNN is a stack of four composite layers. 
And each composite layer has four cascaded convolutional layers.In this region,the 
value of SSTs range from 16°C to 34°C, and the range was rescaled to [—1, 1]. In 
order to fed to the corresponding composite layers at different stack levels, a2 x 2 
average pooling operation was used to downsample the SST maps. These composite 
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Fig. 1 The DL model receives SST maps at the previous and current time steps and then outputs 
the SST map at the future time step. The major part of the DL model is a DNN having four stacked 
composite layers. The bias correction map is added to the DNN output to obtain the forecasted SST 
map 
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layers process the SST maps at different spatial resolutions. The lower the stack level, 
the higher the resolution. Except the top level,each higher resolution composite layer 
at a lower stack level requires the output of the composite layer at the upper stack 
level. And the output need to be up-sampled. The input of the DNN consists of 14 SST 
maps at the current step and the previous 13 steps. Considering the input SST map 
at the current time step is more correlative to the future SST map, the DL model also 
directly linked the input SST map at the current step to the last convolutional layer 
along with the up-sampled output of the lower resolution composite layer at the upper 
stack level. The rectified linear unit function has better error gradient propagation 
[8], so it was used as the activation for the first three convolutional layers of each 
composite layer. The tanh activation was used for the last convolutional layer of each 
composite layer except for the bottom composite layer. The tanh activation rescales 
the output of each composite layer to [—1, 1] that matches the input range of the 
higher resolution composite layer where the output is fed after the up-sampling. The 
activation of the last convolutional layer of the bottom composite layer is a linear 
function and is used to make the DNN output unbounded. The four convolutional 
layers of each composite layer include 8, 16, 32 and 1 channels. The kernel sizes of 
the four convolutional layers of the top composite layer are all 3 x 3. Those of the 
other composite layers are 5 x 5,3 x 3,3 x 3 and 5 x 5, respectively. 

For a general network layer, one site in the output map is connected to multiple 
sites in the input map. Thus, the value at the output site is only dependent on the 
values at these input sites rather than the whole input map. These input sites form 
the receptive field of the output site. For instance, the input sites inside a receptive 
field of a convolutional layer are weighted and connected to the corresponding output 
site by the convolution kernel. The receptive field can be enlarged by using average 
pooling layers to down-sampling the inputs before feeding them to the subsequent 
layer. Then, the output can be treated with the same number of up-sampling layers 
to restore the resolution. SST variations in different locations may be correlated by 
oceanic phenomena with large scales. Considering this, we use the SST series of a 
wider area to forecast the SST at the area center. Therefore, the DNN is designed to be 
multi-scale to obtain the wider receptive field. After three down- and up-samplings 
among the four composite layers, the receptive field size of the whole DNN extended 
by about twelve times. For forecasting TIWs, this size is large enough. 

The SST-map-series samples for building the DNN were divided into the training 
and validation datasets, according to the ratio of 3:1. The input area is set to be larger 
than the output (forecast) area in order to ensure that the input area covers the whole 
DNN receptive field. The following loss function is used to optimize the DNN: 


K 
2 
Loss= > YY (SST Ahm, n) — SST ALC, n)) a) 
k=1 (m,n)E€GridSouput 
where SS T (m, n) is the fifteenth-step satellite SST map. k denotes the kth sample, 
and K is the sample number of the training or validation dataset. (m, n) denote the 
grid (m, n) of the output area, and Grids,,;p,1 is the grid set. SST is n) is the 
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DNN-forecasted SST. The Adam algorithm [16] was used to optimize the DNN 
parameters on the training dataset, and the maximum number of epochs was set to be 
2500. The optimization was implemented using the CUDA technique on a NVidia 
Quadro M4000. The memory of the graphics card is eight GB. In order to avoid 
overfitting to the training dataset, the loss value on the validation dataset was also 
calculated during the optimization procedure. The smallest loss value (the validation 
dataset) was achieved at the one hundred and twenty-nineth epoch costing about one 
hundred and fourteen minutes. The parameter values corresponding to the smallest 
loss value were adopted. 

Parameters in convolutional layers are the same for different sites. In addition, 
there is no optimizable parameter in both average pooling and up-sampling layers. 
Thus, the DNN is independent of the site. However, the environmental background 
of the study area is inhomogeneous. There is a spatial trend that the SST is overall 
higher in the west than in the east. This may cause evolution differences among the 
SST pattern in different areas. Therefore, an SST correction map is included in the 
DL model, which is added to the DNN-forecasted SST map to make the final forecast 
(Fig. 1). By using the samples during the training period, this SST correction map is 
generated by calculating the bias of the DNN at each grid after the optimization. 

The operating efficiency of the developed DL model is very high. It only takes 
about | minute to forecast SSTs for all testing samples on an ordinary desktop 
computer. 


3 SST Forecast of TIW Motion Using the DL Model During 
the Testing Period (2010/01-2019/03) 


Figure 2a—c shows the satellite SST maps of the testing period, and Fig. 2d—f shows 
the SST forecast result by the DL model. The maps are matched closely in shape, 
where the most notable feature is the characteristic of TIWs that propagate westward. 
The characteristic is cusp-shaped and irregular deformations. 


“ew aw 


Fig. 2 Satellite SST maps a to c and DL-forecasted SST maps d to f 
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Fig. 3 The outputs at three consecutive time steps (the same to the steps in Fig. 2) of the fourth- 
(top)-stack-level composite layer of the DNN a to c, the third-stack-level composite layer, the 
third-stack-level composite layer d to f, the second-stack-level composite layer g to i, and the 
first-(bottom)-stack-level composite layer j to 1 


Figure 3 shows the output of the four composite layers in Fig. | at three continuous 
time steps and visualized from the first (bottom) to the fourth (top) stack level of 
the DNN. For the sake of clarity, the coarse-resolution results at higher levels are 
converted to the initial resolution using the nearest neighbor interpolation method. 
Then the results are rescaled to [—1, 1]. All outputs show a westward propagating 
signal similar to the satellite SST maps as shown in Fig. 2a—c. These maps are 
extracted from the DNN network during the training period(2006-2009) and show 
the temporal and spatial characteristics of TIW. Related parameters in the network 
are learned by DNN from sample data. The TIWs’ motion can be forecasted by these 
features. 

The forecasted and satellite SST maps’ meridional averages (MAs) are calculated. 
The maximum detrended cross-correlation between the MAs at the current time step 
and the next step along the equator can estimate the westward propagation speed of 
the SST pattern. 

During TIW Seasons, MAs calculated by SST can reflect the westward propaga- 
tion signal of the SST pattern. The forecast area exists an approximately linear zonal 
trend of SST, which is warm in the western part and cold in the eastern part. Moreover, 
the trend is superimposed with the above signal. An instance of two zonal sequences 
of SST MAs at the longitudes of the grids of the forecast area and at two consecutive 
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Fig. 4 Procedure of estimating SST pattern zonal speed: a Two zonal sequences of SST MAs at the 
longitudes of the grids are calculated from the SST maps at two times, where blue and red denote 
the first time and the second time, respectively. The first sequence is satellite MAs, and the second 
sequence is forecasted by the DL model. b Two sequences of SST MAs after their linear trends 
are removed. € Cross-correlations of the two sequences after the linear trends are removed. d An 
enlarged image of the green box in Fig. 4c, where the three green points are the maximum discrete 
cross-correlation and the two cross-correlations at the neighboring discrete zonal lags. e The three 
green points can be interpolated with a quadratic curve (black line), and the zonal lag corresponding 
to the peak of the curve is considered as the exact zonal lag with the maximum cross-correlation. 
The speed can then be estimated by dividing the exact zonal lag by the time interval 


time steps is given (Fig. 4a). The red lines represent the MAs of the DL-forecasted 
SST map after five days(one time step), and the blue lines represent the MAs of 
the satellite SST map sequence. The westward propagation of the signal becomes 
more obvious after removing the linear zonal trend of the SST MAs(Fig. 4b). The 
two sequences of detrended SST MAs series’s cross-correlations can be calculated 
at the discrete zonal lags (Fig. 4c) , and can find the discrete lag with the maximum 
cross-correlation and its two neighboring discrete lags (Fig. 4d). A quadratic curve 
can interpolate the cross-correlation of three discrete lags. The peak lag of the inter- 
polated curve is considered to be the exact lag of the maximum cross-correlation 
between two non-trending SST MA sequences(Fig. 4e).In mathematical form, this 
is 


1 yi (lag? — lags?) + yo(lag3” — lag”) + ys (lagi? — lag?) 
2 yi (lagz — lag3) + y2(lag3 — lag) + y3(lag, — lag2) 


(2) 


la exact = 
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Fig.5 Temporal variation of the SST pattern associated with TIW westward propagation during the 
testing period. We calculated the MAs of the satellite and forecasted SST maps and then estimated the 
speed of the SST pattern westward propagation based on the maximum detrended cross-correlation 
along the equator between the MAs of the satellite SST map at the current time step and those of 
the satellite or predicted SST map (brown dashed curve: the speed calculated by satellite/satellite 
pairs, green solid curve: the speed calculated by satellite/DL-predicted pairs, orange dotted curve: 
the daily Nifio3.4 index) 


where Jag , lagn, and lags are the three discrete lags, and, y1, y2, and y3 are the 
corresponding cross-correlations. Finally, the propagation speed can be obtained by 
dividing the exact lag by the time interval. 

Figure 5 shows the estimated speeds mainly ranges from 0 to 100 km/day [3, 
4, 12, 13, 19, 28, 29, 38, 39]. The green solid curve represents the SST pattern 
propagation velocity predicted by the DL model. The red dashed curve represents 
the velocity estimated by the satellite/satellite SST MA pairs. The two curves are in 
good agreement. Both curves show very consistent TIW seasonal fluctuations.In the 
TIW season, TIW controls the motion of the SST pattern. Thus, the DL-forecasted 
SST pattern propagation velocity can be regarded as the TIW speed. Nevertheless, 
the SST pattern is inert, and there is no apparent westward motion in the no- or 
weak-TIW seasons. 

The DL model can also forecast recursively. In this recursive frame, the forecasted 
SST, the present satellite SST, and the previous 12 satellite SSTs were used to forecast 
the SST at the second recursive step, and then, the two forecasted SSTs, the current 
satellite SST and the previous 11 satellite SSTs were utilized to forecast the SST at 


56 G. Zheng et al. 


iw 16s'w 1 row "i 144° 


maw eww Bw aw STW saaw f 120w 


Fig. 6 Satellite-observed SST a to c and DL model-forecasted SST d to f 


the third recursive step. Therefore, the DL model recursively forecasts the SST in 
the subsequent steps (the fourth, fifth, sixth, etc. recursive steps). Figure 6 shows an 
example of the recursively forecasted SST maps at the subsequent three time steps 
after the final time step in Fig.2. As can be seen from the figure, the DL model can 
still work well and forecast the TIWs’ westward motion in general. 


4 Interannual Variation in TIW Westward Propagation 


The daily Nifio3.4 index data were also overlaid on Fig.5, and denoted by orange 
dotted curve. The data was provided by the KNMI (the Royal Netherlands Meteoro- 
logical Institute) Climate Explorer. Fig. 5 shows that the DL-forecasted TIW speed 
values and the Nifio3.4 index values are 180 degrees out of phase. There is a major El 
Niño event from 2014 to 2016, and the TIW speeds were almost zero for the weak- 
ening of meridional SST gradients during this time. The measurements of mooring 
and Argo float from 2000 to 2010 also validate this fact, in which TIW kinetic energy 
and occurrence probability show negative correlation with the Nifio3.4 index [11]. 
The correlation coefficient between the Nifio3.4 index values and the speed values 
estimated from satellite/satellite SST MA pairs is -0.38, with a P-value close to zero 
and a 95% confidence interval of (—0.35, —0.41). The corresponding statistic results 
for the DL-forecasted speeds are -0.53, with a P-value close to zero and (—0.50, 
—0.55). 


5 Zonally Westward Propagation of TIWs 


Figure 7 gives the zonal TIW westward propagation speeds at 2-degree latitude bands, 
which were estimated from the satellite/satellite maps and the satellite/DL-forecasted 
SST maps, respectively. As can be seen from the figure, the estimated speed distribu- 
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Fig.7 Zonal TIW westward propagating speeds at 2-degree latitude bands. a Distribution of speeds 
estimated from satellite/satellite SST MA pairs and distributions of forecasted speeds estimated from 
b satellite/DL-forecasted SST MA pairs. The white blanks denote outliers beyond the range from 
0 to 100 km/day 


tions are consistent with each other and their temporal fluctuations are similar during 
TIW seasons. The fluctuations are also similar to the curves in Fig. 5. Furthermore, 
the equatorial bands have higher speeds than the higher-latitude bands. All these 
results are in agreement with the previous findings for the reason that TIWs at dif- 
ferent latitudes are controlled by different dynamic mechanisms with their speeds 
determined by equatorial wave processes [22, 38]. 


6 Accuracy During the Testing Period (2010/01-—2019/03) 


The root mean square error (RMSE) and bias variation of the DL model over time 
were calculated during the testing period and are given in Fig.8. From the figure, 
it can be seen that the RMSE and bias are generally stable. The RMSE fluctuates 
between 0.15°C to 0.45°C, while the bias fluctuates between —0.15°C to 0.15°C. 
Due to the rapid change of the SST pattern, the RMSE of the DL model is larger 
during the TIW seasons (Fig. 8a). There are approximately 3300 samples at each 
grid point. The RMSE and bias at each grid were calculated, and the RMSE and 
bias spatial distributions of the DL model are given in Fig. 9. The RMSE of the cold 
tongue area is higher than other areas. This is caused by the large spatial gradient 
and fast temporal variation of the SST in the cold tongue area. In the study area, the 
global RMSE of all grids and all samples is 0.29°C and the bias is —0.01 °C. 

For the recursive forecasting, the global RMSE and bias of the DL model from 5 
days to 150 days after the current time step (i.e., recursive steps 1 to 30) are given in 
Fig. 10. It can be found that the DL model’s accuracy declines with the evolution of 
time. It should be noted that there will be no satellite SST in the model input after 14 
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Fig. 8 RMSE a and bias b temporal trends. The RMSE and bias temporal trends were calculated 
sample by sample from the forecasting errors at all grids 
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Fig. 9 RMSE a and bias b spatial distributions. The RMSE and bias spatial distributions were 
calculated grid by grid from the forecasting errors of all samples 


recursive steps. Even so, the RMSE does not grow rapidly and is still smaller than 
0.80°C at the 15th recursive step. Meanwhile, the magnitude of the DL model’s bias 
is also smaller than 0.10°C at the 30th recursive step. 


7 Conclusions 


In this chapter, a data-driven DL SST forecasting model using the DNN technique 
was built. The DL model accurately forecasted the spatial-temporal variation of the 
SST pattern with a RMSE of 0.29°C and the TIW’s propagations that agree well 
with actual satellite observations. 

The DL model is different from previous models. The DL model consists of a 
multi-scale DNN with four stacked composite layers and a time-independent but 
site-dependent bias correction map. In this design, the DL model takes the spatial 
dependence of a site-specific forecast over a large surrounding area and the bias 
correction of the DNN at different sites into consideration. The DL model was tested 
for nine years without overlapping with the training period. The results show that 
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Fig. 10 The global RMSE and bias of DL model implemented recursively concerning the number 
of recursive steps. In the recursive model, the DL-forecasted SST at a future time step is fed back 
to the model input to forecast the SST at the next future time step. The recursive steps from 1 to 
30 are correspond to 5 days to 150 days after the current time step. After 14 recursive steps, there 
is no satellite SST map at the model input, and all input SST maps are from the model’s forecast. 
The global RMSE and bias were calculated from the forecasting errors of all samples at all grids at 
each recursive step 


the DL model effectively forecasts the SST variation associated with TIWs. The 
DL-forecasted TIW speed is in good agreement with that estimated from the satellite 
SST maps. Both of the speeds present the consistent seasonal cycle and interannual 
modulation, and the interannual modulation is negatively correlated with the Nifio3.4 
index. TIW speeds are higher in equator than other latitudes. The DL model can also 
forecast SSTs at future steps in a recursive manner, although the accuracy degrades 
with time for the loss of actual satellite SST input. 

The developed model results show DNN’s great potential for marine forecasting 
utilizing gridded data. Compared with numerical forecasting models, DL forecast 
models are straightly driven by real measurements and elude the complex process, 
including model parameterizations and approximations, various physical equations, 
and a substantial computational burden. DL models are able to forecast accurately 
with the help of a few physical parameters’ prior information. In our case, only one 
SST parameter was used. Almost all of the DL model’s computational cost is spent on 
the iterative optimization of the weights. Emerging technologies on hardware, e.g., 
CUDA, can easily speed up this learning procedure. If the DNN has been trained 
and obtained the bias correction map, the DL model can make an efficient forecast 
with no iteration. Therefore, it can work very rapidly. In our case, it only takes 
about one minute to forecast the SST pattern of the testing period by an ordinary 
desktop computer. As far as DNN is a data-driven technology, whether training or 


60 


G. Zheng et al. 


using, sufficient data is always the basic requirement. Fortunately, sufficient data and 
DNN’s outstanding learning capability fully cater to the growing amount of marine 
satellite observations in the era of remote sensing big data. 
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1 Significance of Sea Surface Height Anomaly Prediction 


Sea surface height anomaly (SSHA) is one of the essential parameters for investi- 
gating ocean dynamics and climate change [4, 6, 32] and indicates mesoscale ocean 
dynamics features such as currents, tides, ocean fronts, and water masses. In addi- 
tion, it is an important parameter for marine disaster emergency response [10, 30, 37, 
39]. Historically, sea-level changes were computed using the tide gauge data [15], 
[40]. Compared with sparsely distributed tide gauge stations, recently, the advent 
of satellite altimetry has enabled constant sea-level measurements that include the 
entire sea area. The use of altimetry allows to acquire a particular sea-level change 
to the terrestrial reference frame, thus offering high-precision data for studying sea 
level [1, 8, 13, 25, 34]. 


2 Review of SSHA Predicting Methods 


To predict sea-level changes, numerous approaches based on satellite altimetry data 
have been proposed, which can be categorized according to the type of model used 
as physical-based [7, 16, 27] and data-driven models [3, 31]. 

Physical-based models estimate sea-level changes by statistically combining 
related physics and dynamics equations. Reference [7] predicted the annual and 
semi-annual global average sea-level changes by employing a hydrological model. 
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Reference [16] compared the future worldwide and regional sea-level anomalies 
caused by the emissions of greenhouse gas in the 21st century in accordance with 
the Hadley Center climate simulations (HadCM2 and HadCM3). Reference [27] 
attempted to predict worldwide seasonal sea-level anomalies ahead of 7 months by 
developing the dynamic atmosphere-ocean coupled model. 

Data-driven models build mapping correlations between sea-level anomaly 
records by using statistical methods and are comparatively more accurate in the 
prediction of sea-level changes. Reference [3] forecasted seasonal sea-level anoma- 
lies in the North Atlantic based on the auto-regressive integrated moving average. By 
using the least-squares (LS) method, [31] forecasted the global average and gridded 
sea level anomalies of the eastern equatorial Pacific by using a polynomial-harmonic 
model. To assess the sea-level changes on the mid-Atlantic coast of the United States, 
[12] developed a technique based on empirical mode decomposition (EMD). Refer- 
ences [33] and [28] respectively studied sea level anomalies based on the changes in 
earth’s temperature and ice sheet flow by using semi-empirical methods. Reference 
[14] proposed a hybrid model that combines the EMD, LS, and singular spectrum 
analysis to predict long-run sea-surface anomalies in the South China Sea (SCS). 
Reference [17] employed evolutionary support vector regression (SVR) and gene 
expression programming to predict sea-level anomalies in the Caspian Sea by using 
previous sea-level records. They also combined SVR with empirical orthogonal func- 
tion (EOF) [18], wherein they adopted SVR for predicting sea-level anomalies in the 
tropical Pacific and simultaneously applied EOF for extracting the main components 
with the aim to lower data dimensions. 

Deep learning (DL) is a data-driven approach that is well adapted to nonlinear rela- 
tionships. Recently, DL has been used for forecasting time-series data [9, 19, 22, 36, 
42]. The similarity between SSHA pattern prediction and time-series data prediction 
has prompted researchers to propose a variety of data-driven models for predicting 
SSHAs by using DL. Reference [5] adopted an RNN network for predicting and ana- 
lyzing sea-level anomalies. RNN networks outperform simple regression models by 
extracting and fusing the characteristics of the time dimension [11]. Reference [23] 
proposed a DL model that integrates long short-term memory (LSTM) network with 
an attention mechanism [2] to reliably predict SSHAs. Reference [38] developed the 
merged-LSTM model and showed that it is superior to several advanced machine 
learning approaches in predicting sea-surface anomalies. 

The aforementioned RNN/LSTM forecasting strategies focus on temporal change 
modeling, where constant state data updating is practiced within every LSTM unit 
over time. They perform SSHA estimation by utilizing the former SSHA series 
either at a single site or at its tightly adjoining sites, thus failing to consider the 
veiled information of the SSHA series at the remaining associated remote positions. 
However, the sea level of a region is influenced by both nearby and distant areas. In 
addition, spatial deformations and temporal dynamics have equal significance in the 
prediction of forthcoming SSHA fields. 

In addition, the former approaches can only predict the SSHA at a single grid and 
not the SSHA over the entire region. In terms of each grid in the region, the models 
must be trained. Thus, to forecast the value of all grids in the area, each grid must be 
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trained numerous times to acquire diverse network parameters. In addition, the high 
model storage and retraining time are not acceptable. 

In the following section, we introduce an SSHA prediction method named multi- 
layer fusion recurrent neural network (MLFrnn). MLFrnn can be used in the temporal 
and spatial domains and can accurately predict the SSHA map for the entire region. 


3 Multi-Layer Fusion Recurrent Neural Network 
for SSHA Field Prediction 


We introduce a classical spatiotemporal forecasting architecture in Sect. 3.1 and then 
present the network architecture of MLFrnn Sect. 3.2. Finally, the multilayer fusion 
cell as the fundamental building block of MLFrnn is discussed in Sect. 3.3. 


3.1 A Classical Spatiotemporal Forecasting Architecture: 
ConvLSTM 


Spatiotemporal zone modeling is necessary for the prediction of SSHA fields, an 
unachievable task using the RNN/LSTM forecasting strategies. For extracting spa- 
tiotemporal information from time-series data, [35] developed ConvLSTM, a variant 
of LSTM where three-dimensional (3D) tensors are used for representing all of the 
inputs, gates, cell states, and hidden states. ConvLSTM utilizes convolution opera- 
tors to capture spatial characteristics by clearly encoding the spatial information into 
tensors. The critical equations of ConvLSTM are as follows: 


fi = 0 (Wyf * Xi + Wns * Hii + by) 

i, = o (Wyi * Xi + Wai * Hi1 + bi) 

gı = tanh (Wye * Xı + Whe * Hi1 + be) 
C, =frOG-i1ti Og 

0, = 0 (Wyo * Xi + Who * Hy-1 + bo) 
H, = o; © tanh (C;) 


d) 


In this architecture, x denotes the convolutional operator, © denotes the Hadamard 
product, and ø refers to a sigmoid activation function with a value of [0, 1] describ- 
ing the amount of to-be-transmitted information about the state. The cell state con- 
veyance is dependent on i,,g,, f; and o, which helps avoid fast gradient disappearance 
by capturing the state in the memory and helps alleviate the problem of long-run 
dependency. 

By stacking ConvLSTM, an encoder-decoder network can be obtained, as shown 
in Fig. 1. The ConvLSTM network takes X, as the input of the first layer and X, as the 
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Fig. 1 ConvLSTM network 


prediction result of the n-th layer. In addition, the ConvLSTM transmits hidden states 
in both vertical and horizontal directions. The update of the cell state is restricted 
within every layer of ConvLSTM and transmitted only in horizontal directions. The 
cell states of various layers are mutually irrelevant. Thus, every ConvLSTM layer 
will completely ignore cell states at lower layer. Moreover, the original layer (the 
lowest layer) does not reflect the memory contents of the deepest layer (the n-th 
layer) at the prior time step. 


3.2 Architecture of MLFrnn 


The states of different ConvLSTM layers are collectively irrelevant, and the interlayer 
connections are not explored sufficiently. To tackle this problem, the MLFrnn was 
developed in this study for the estimation of SSHA fields. As MLFrnn’s fundamental 
building element, a novel type of multilayer fusion cells was designed that enable 
spatiotemporal trait acquisition at the nearby and remote positions for the SSHA 
fields. These features are delivered both horizontally and vertically. 

Let’s denote the SSHA field at time step t by X,. The SSHA field can then be 
predicted based on the former t-days observations, as follows: 


X p41 = argmax p ( Xoil Xi, X2, ..., Xr) (2) 
Xi4t 
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Information flow: —»:SLA field(X) | —» :prediction(X) —> :relevance state(R) 
—> :cell state(C), hidden state( H) 


Fig. 2.) MLFrnn model for forecasting SSHA fields. The red and purple arrows respectively indicate 
the directions of the information flows for inputs and forecasts, whereas the green arrows denote 
the flow direction of the cell/hidden state information that is transmitted both transversely and 
longitudinally. The blue arrows show the flow direction of relevance state information constituting 
the pristine-deepest layer connection 


where p(-) represents a conditional probability and X ;,1 is the forecast SSHA field 
at time step t + 1. 

Figure 2 presents the architecture of the MLFrnn model. It can extract cell state 
C? and hidden state H” (t denotes time and n represents the number of layers) for 
storing long-run and short-run memories. The cell state is expressed in vertical and 
horizontal ways rather than horizontally as in ConvLSTM. To study SSHA variations, 
the MLFrnn model fuses cell states and hidden states from diverse layers. 
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Additionally, in ConvLSTM, the pristine and deepest layers are irrelevant at the 
prior time step. To overcome this limitation, a supplementary relevance state R? is 
added, which accomplishes the spatial trait storage for the SSHA fields and their 
straight transverse updating. In the subsequent time step, the pristine layer is fed 
with the relevance state of the deepest layer R”=*, to allow a more abundant spatial 
feature. 

The forecast of the SSHA field X, is acquired by adopting a | x 1 convolution 
operation to the hidden state from the deepest layer He, 


3.3 Multi-layer Fusion Cell 


A recurrent forecasting module and a feature fusion module constitute a multilayer 
fusion cell, which completely fuses cell state, hidden state as well as relevance 
state to study the variations of SSHA fields (Fig. 3a). The structure of the recurrent 
forecasting module is presented, and then the feature fusion module is discussed. 


3.3.1 Recurrent Forecasting Module 


The recurrent forecasting module can simultaneously encode the relevance state and 
the cell state. As shown in Fig. 3b, the structures in the red box encode the relevance 
state, and the structures in the blue box encode the cell state. Subsequently, the 
recurrent forecasting module extracts the hidden states transferred to the next time 
step cell via the output gate. 

(1) Encoding relevance states: The time-series data X, and relevance state R? =! 
are considered as inputs. The input gate i ,, forgetting gate f ,, and input-modulation 
gate g, control the update of all the relevance states. The equations for encoding 
relevance states are: 


f= (Wap Xr + Wya RT br) 
tg (W's s X, + Wy & RE + bi) 
(3) 
gi = tanh (W xp Xi + Wre * RIT! + b'e) 
R =f OR +i: Og, 


(2) Encoding cell states: The time-series data X, and hidden state H’ į are con- 
sidered as inputs. The input gate i,, forgetting gate f,, and input-modulation gate g, 
control the update of all the cell states. The equations for encoding cell states are: 
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Fig. 3 Multilayer Fusion Cell. a Structure of multilayer fusion cell: The multilayer fusion cell 
consists of a recurrent forecasting module and a feature fusion module. Concentric circles denote 
concatenation. b Recurrent forecasting module: The extraction of relevance states is shown in the 
red box. The extraction of cell states and hidden states is presented in the blue box 
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fr =o (Wop * Xi + Wap * HE, + by) 
li = 0 (Wei x Xı + Whi * Ay T bi) 

gı = tan h (Wry * X, + Whe * Hy, + bg) 
C = fOC +O 


(4) 


(3) Extracting hidden states: The output gate of the recurrent forecasting module 
relies on Cř, H/"_,, and R}. The output gate extracts the hidden state from the cell 
and relevance states as the output. The equations for extracting hidden states are: 


01 = 0 (Wyo * Xt + Who * Hl, + Weoo * C” + Wro * R? + bo) 


A" = o, © tanh (Wixı *[C”, R”])} ©) 
where x is the convolutional operator, © denotes the Hadamard product, and o repre- 
sents a sigmoidal activation function with a value of [0, 1], describing the amount of 
each to-be-transmitted information about the state. The recurrent prediction module 
implements the spontaneous application of broad-spectrum receptive fields by using 
a series of convolutional operators so that the evolutions at the adjoining and distant 
sites of the SSHA domain can be portrayed [21]. 


3.3.2 Feature Fusion Module 


Figure 4 shows the feature fusion module. In this study, A", was concatenated with 
the hidden state from the former layer H/"~' as one of the inputs for feature fusion 
module. C /_, was concatenated with cell state from the former layer C ee and taken 
as another input. The feature fusion module can be defined as 


fr = o (Wer x [CT + Wag [BPE | + by) 
io (Wer $ [ae] + Wpi * | + bi) 

8 = tanh (Wee * Gael + Whg * [aE] + be) © 
0 =O (Weo * lec + Who * | + bo) 


C? = fi © Conv: aed +i; © 8 
H? = 0; © tanh (CF) 


The cell and hidden states from different layers are integrated using the feature 
fusing module. While the states from the shallow layer comprise the nearby location 
information on a local scale, the states from the deep layer comprise the remote 
location information on a global scale. For the generation of subsequent SSHA fields, 
the local and worldwide information is integrated using the feature fusing module. 
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Fig. 4 Feature fusion module 


4 Experimental Results and Discussion 


The experimental results of the MLFrnn model are presented. First, the study area, 
dataset, and implementation details are presented. Next, the MLFrnn is compared 
with the currently available approaches. Moreover, the prediction results of the 
MLFrnn in different seasons are revealed. Finally, the impact of the different fusion 
modules and the number of layers on SSHA prediction is discussed. 


4.1 Study Area and Dataset 


As asemi-enclosed basin, the SCS connects the Pacific Ocean with the Indian Ocean. 
In addition, it features a complex seafloor topography accompanied by mesoscale 
eddies and frequent storm surges. It is considered as an area where oceanic and 
atmospheric modes exert a strong influence on the sea level [26, 29, 41, 43]. As a 
result, the fluctuating features of SSHA in the SCS are suitable for confirming the 
performance of the proposed network model. A subarea of the SCS was chosen as 
our study area, spanning 4.875°N-19.625°N and 109.875°E-119.625°E (red box in 
Fig. 5). 

The altimeter data of satellites were sourced from different sensors, includ- 
ing Envisat, ERS-1/2, GFO, Jason-1/2/3, and T/P. The data generation was based 
on archiving, confirming, and interpreting satellite oceanographic (AVISO) data, 
while the data distributor was Copernicus Marine Environment Monitoring Service 
(CMEMS). The mean everyday data (0.125°N-25.125°N, 100.125°E-125.125°E) 
between January 1, 2001 and May 13, 2019 were utilized for experimentation, with 
a Spatial resolution of 1/4° latitude x 1/4° longitude. The training set comprised the 
SSHA fields recorded between January 1, 2001 and May 1, 2016, while the test set 
consisted of the SSHA fields recorded between May 2, 2016 and May 13, 2019. 
Among 6647 sequences of grouped data, 5570 sequences belonged to the training 
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set, and 1077 sequences belonged to the test set. There were 31 tensors in every 
sequence, and the inter-tensor temporal interval was set as | day. In this study, the 
initial 10 tensors were regarded as the input, whereas the subsequent 21 tensors were 
regarded as the prediction reference. Prior to model feeding, the augmentation of 
data was accomplished using horizontal mirroring. 


4.2 Implementation Detail 


The spatial zone of the monitored SSHA field was considereda H x W grid involving 
L measurements. Accordingly, a 3D tensor X, with dimensions L x W x H was used 
to denote the daily SSHA data. A tensor sequence X1, X2, ..., X, was established 
by the monitoring results across f time steps from a temporal perspective. Prior to 
MLFrnn model feeding, the foregoing tensors were normalized within a [0, 1] scope. 
The normalization procedure allowed better centralization of data so that the model 
training and convergence could be accelerated. In the following experiments, H and 
W were set as 100, L was set as 1, and ¢ was set as 10. 


Sea Surface Height Anomaly Prediction Based on Artificial Intelligence 73 


At the model training stage, all the initial state parameters were assigned as zero for 
the hidden states H/o, cell states C/_,, and relevance states R/_). Upon completion 
of 80,000 iterations, the training procedure was terminated, and every iteration had 
a mini-batch size of 8. The learning rate was 0.003 at baseline value, which was 
progressively decreased by a factor of 0.9 per 2500 iterations [24]. The MSE was 
adopted as the loss function, and the optimizer proposed by Adam [20] was used. A 
rapid decrease in the loss function was noted, ultimately converging to a small value. 
The repeated training outcomes differed only slightly. 

For assessing the performance of the MLFrnn model, the following three metrics 
were adopted: the mean absolute error (MAE), the root mean square error (RMSE), 
and Pearson’s correlation coefficient (r). 


4.3 Experiment Results and Discussion 


To predict the SSHA field for 21d ahead and assess the performance of MLFrnn 
against the strong recent baseline ConvLSTM, the MLFrnn model was studied 
experimentally. The MLFrnn model was compared with the currently available DL 
approaches, namely the merged LSTM and attention-based LSTM from space and 
time dimensions (LSTM + STA). 

In Table 1, the RMSE values of the MLFrnn and ConvLSTM models with a 
varying number of layers are presented for comparison. MLFrnn outperformed Con- 
vLSTM in terms of predictive capacity, and with an increase in the time step, serious 
performance degradation was noted for ConvLSTM. Moreover, MLFrnn exhibited 
superior SSHA field predictability compared to ConvLSTM owing to the blending 
of spatiotemporal features (both local and global). 


Table 1 RMSE values of SSHA field prediction 1-21 days ahead of four-layered 
MLFrnn (MLFrnn-4), single-layered ConvLSTM (ConvLSTM-1), and four-layered ConvLSTM 
(ConvLSTM-4) 


Prediction lead (day)| 1 2 4 5 6 7 
ConvLSTM-1 0.00416 0.00678 0.01350 0.01757 0.02191 0.02643 
ConvLSTM-4 0.00353 0.00557 0.01080 0.01389 0.01716 0.02054 
MLFrnn-4 0.00324 0.00506 0.00946 0.01200 0.01466 0.01739 
Prediction lead (day)| 8 9 11 12 13 14 
ConvLSTM-1 0.03103 0.03564 0.04474 0.04918 0.05353 0.05779 
ConvLSTM-4 0.02396 0.02735 0.03389 0.03698 0.03991 0.04266 
MLFrnn-4 0.02014 0.02288 0.02821 0.03078 0.03326 0.03565 
Prediction lead (day)| 15 16 18 19 20 21 
ConvLSTM-1 0.06195 0.06602 0.07389 0.07771 0.08145 0.08511 
ConvLSTM-4 0.04523 0.04762 0.05186 0.05372 0.05542 0.05696 
MLFrnn-4 0.03795 0.04017 0.04436 0.04634 0.04826 0.05012 


74 Y. Zhou et al. 


Table 2 SSHA field predictability (RMSE) of four-layered MLFrnn (MLFrnn-4), LSTM + STA, 
and merged LSTM 


Prediction 1 2 3 4 5 

lead (day) 

Merged 0.0028 0.0059 0.0088 0.0120 0.0160 
LSTM 

LSTM + STA | 0.0038 \ N \ \ 
MLFrnn-4 0.00324 0.00506 0.00714 0.00946 0.01200 


Fig. 6 Summertime MLFrnn qualitative findings for SSHA forecasts. To achieve SSHA-field esti- 


mation for 1-7 days ahead, a 10-day monitoring of the SSHA field was accomplished. a SSHA-field 
observations. b SSHA-field forecasts. c Deviations of SSHA-field observations from forecasts 


The predictive behavior of the MLFrnn model was compared with that of LSTM 
+ STA and merged LSTM (Table 2). The SSHA estimation data for the latter two 
models were available only for 1 and 5d ahead, respectively. Therefore, MLFrnn 
was compared at identical prediction times. Interestingly, despite the design purpose 
of MLFrnn to estimate the future 21-day SSHAs, it was noted to be superior in the 
case of short-period prediction as well because the vector input shortcoming with 
the LSTM was surmounted by the MLFrnn model, which can accomplish concurrent 
SSHA field elucidation for spatiotemporal architectures. 

For the performance characterization of the MLFrnn model, we studied the 
MLFrnn prediction outcomes across various seasons. The summertime and winter- 
time MLFrnn predictions are displayed in Figs. 6, 7, 8,9, 10, and 11, where the SSHA 
observations, MLFrnn forecasts and, their deviations are presented in a top-bottom 
order. Quite evidently, there were small and permissible differences in the observa- 
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Fig.7 Summertime MLFrnn qualitative findings for SSHA forecasts. To achieve SSHA-field esti- 
mation for 8—14 days ahead, a 10-day monitoring of the SSHA field was accomplished. a SSHA-field 
observations. b SSHA-field forecasts. c Deviations of SSHA-field observations from forecasts 
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Fig.8 Summertime MLFrnn qualitative findings for SSHA forecasts. To achieve SSHA-field esti- 
mation for 15-21 days ahead, a 10-day monitoring of the SSHA field was accomplished. a SSHA- 
field observations. b SSHA-field forecasts. c Deviations of SSHA-field observations from forecasts 


tions made from the forecasts. As suggested by this finding, the SSHA predictability 
of MLFrnn is preferable across various seasons. 
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Fig. 9 Wintertime MLFrnn qualitative findings for SSHA forecasts. To achieve SSHA-field esti- 


mation for 1-7 days ahead, a 10-day monitoring of the SSHA field was accomplished. a SSHA-field 
observations. b SSHA-field forecasts. c Deviations of SSHA-field observations from forecasts 
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Fig. 10 Wintertime MLFrnn qualitative findings for SSHA forecasts. To achieve SSHA-field esti- 


mation for 8—14 days ahead, a 10-day monitoring of the SSHA field was accomplished. a SSHA-field 
observations. b SSHA-field forecasts. c Deviations of SSHA-field observations from forecasts 


Sea Surface Height Anomaly Prediction Based on Artificial Intelligence 77 


2? 


2 


T=15 T=16 T=17 T=18 T=19 T=20 T=21 
a 


FEFFE 
aaa aa 


E E’ a’ d Zee | H: 
E A/A E A EA EY EY EHX LS 


Fig. 11 Wintertime MLFrnn qualitative findings for SSHA forecasts. To achieve SSHA-field esti- 
mation for 15-21 d ahead, a 10-day monitoring of the SSHA field was accomplished. a SSHA-field 
observations. b SSHA-field forecasts. c Deviations of SSHA-field observations from forecasts 


4.4 Ablation Study 


To explore the contributions of the feature fusion module and the number of layers, 
an ablation study was conducted experimentally for the following models: 


(1) The two-layered MLFrnn based on the feature fusion module (MLF(F)-2). 

(2) The three-layered MLFrnn based on the feature fusion module (MLF(F)-3). 

(3) The four-layered MLFrnn based on the feature fusion module (MLF(F)-4). 

(4) The four-layered MLFrnn based on a3 x 3 convolution as feature fusion module 
(MLF(Conv)-4). 


The RMSE values are displayed in Fig. 12 for the future 21-day forecasts obtained 
using various models. MLF(F)-4 was compared with MLF(Conv)-4. The SSHA field 
evolutions were forecasted by MLFrnn by using the feature fusionmodule; thus, 
exhibiting higher accuracy. This was probably due to the preferable modeling of 
long-term SSHA dependencies by the MLFrnn owing to the feature fusion module. 

Further comparison was made concerning the SSHA field predictive behaviors 
among the models having a different number of layers. MLFrnn-4 outperformed 
others in terms of predictability. The broader SSHAs from peripheral zones were due 
to the increased number of layers, which facilitated better accuracy of the forecasts. 
The MAE comparisons of the 1-21-day SSHA-field forecasts obtained using different 
models are presented in Fig. 13. The similarities in the trend of these data to the 
RMSEs in Fig. 12 were noted. 
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Fig. 12 Comparison of the RMSE values regarding the future 1-21-day SSHA-field forecasts 
obtained using different fusion modules and number of layers 


The correlation coefficient reveals the degree of linearity among the investigated 
parameters. The Pearson’s correlation coefficient values (r) of SSHA forecasts and 
observations on the entire samples were determined and compared (Fig. 14). As is 
clear, compared to the remaining three models, MLFrnn-4, having a feature fusion 
module, exhibited greater r values all along. Moreover, a slower decrease in r was 
observed MLFrnn-4 as compared to the other three models. This confirms the positive 
linearity between the SSHA field (forecasted using the feature fusion module-based 
MLFrnn-4) and the true value. 


5 Conclusion 


In this chapter, we first introduced the significance of SSHA prediction and then 
presented a review of currently available SSHA prediction methods. Next, an SSHA 
forecasting model named MLFrnn was proposed. The main advantages of MLFrnn 
can be summarized as follows: 
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Fig. 13 Comparison of the MAE values regarding the future 1-21-day SSHA-field forecasts 
obtained using different fusion modules and the number of layers 


(1) The proposed MLFrnn model can acquire spatiotemporal traits from both nearby 
and distant sites. Existing RNN/LSTM-based studies emphasize temporal mod- 
eling while disregarding spatial information. In contrast, the proposed MLFrnn 
model prominently improves the SSHA predictability by modeling both the tem- 
poral variations and the spatial evolutions of the SSHA fields. 

(2) MLFrnn enables prediction for the entire SSHA map rather than single-site 
forecasts of the SSHA. Prior approaches could achieve SSHA estimation for 
only one grid and required repeated model training for performing predictions 
for the entire zone. In contrast, the MLFrnn model can perform predictions for 
the entire SSHA map accurately. 

(3) In this study, a type of multilayer fusion cells was developed for MLFrnn to fuse 
local and global spatiotemporal characteristics. In addition, reliable modeling of 
the SSHA field evolutions was achieved using the SSHAs from both nearby and 
remote locations. 


Finally, we selected the SCS as study area and presented the experimental results 
of the MLFrnn model on the daily average satellite altimeter SSHA data for nearly 
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Fig. 14 Comparison of the Pearson’s correlation coefficient values r regarding the future 1-21-day 
SSHA-field forecasts obtained using different fusion modules and the number of layers 


19 years. The experimental results demonstrated that the MLFrnn model is effective 
and has better performance than the currently available DL networks in predicting 
the SSHA field. 
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1 Introduction 


Internal solitary wave (ISW) is a ubiquitous phenomenon in the world’s oceans, 
particularly in continental and marginal waters [4, 10, 13, 28]. ISWs are generated 
at the mixed layer beneath the ocean surface and show less obvious manifestations 
on the ocean surface. IS Ws can travel hundreds of kilometers while maintaining their 
waveform or amplitude, owing to nonlinear and dispersion effects. The resort force 
of ISWs is the reduced gravity, which promises the generation of large-amplitude 
ISWs. ISWs in the South China Sea (SCS) were observed with amplitude over 240 m. 
The length of wave crest (LWC) of ISWs can also extend to several hundreds of 
kilometers. ISWs are found to travel across the whole northern SCS, the Andaman 
Sea, and the Sulu-Celebes Sea within a few days. 

Ocean habitats, off-shore engineering, ocean military, ocean mixing, and sediment 
resuspension can all be affected by the propagation and breaking of ISWs. While the 
wave crest of ISWs can be extended to hundreds of kilometers, the ISW scale across 
the wave crest only ranges from several hundreds of meters to several thousand. ISW 
propagation speed ranges between 2.0—3.0m/s in the deep ocean and 1.0-2.0 m/s 
on the continental shelf. The across wave crest features of ISWs means the ISW 
will pass by a fixed location within a few tens of minutes. Considering the fast 
propagating and large amplitudes of ISWs, the ISWs are extremely dangerous to 
submarine or underwater vehicles. The propagation of ISWs will be accompanied 
by ISW-induced currents, and severe shear forces will endanger the safety of off- 
shore equipment, such as oil rigs. The propagation of large amplitude ISWs will 
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induce strong vertical mixing in the ocean and affect the distribution of suspension 
or sediments [8]. The importance of ISW makes the forecast of ISW propagation a 
meaningful but challenging task. 

Global observation, random generation, fast propagation, and significant impact 
on the ocean have made ISWs a hot topic for decades. Various methods have been 
used to study ISWs, such as numerical models, in-situ observations, and satellite 
observations. Remote sensing techniques have developed rapidly and show signif- 
icant advantages in ISW studies [10, 11]. To describe ISW propagations, various 
theories were developed, such as the Korteweg-de Vries equation (KdV) equation, 
the Benjamin-Ono (BO) equation, and the numerical models. The KdV equation for 
the propagation of ISWs is given by 


Ni + Cons FON. + YN xxx = 0 d) 


Here n is the amplitude of the solitary wave and co is the linear phase speed. When 
the nonlinear term «œ is balanced with the dispersion term y, one gets the solitary 
wave with an analytical form 


n (x,t) = nosech? [(x — ct) /L] (2) 
= No 
c=a(1+ 2) (83) 


L= jy (4) 
ano 


where no is the maximum amplitude, L is the characteristic length, h is the water 
depth, and g is the acceleration due to gravity. The KdV equation is commonly used 
to describe the propagation speed of ISWs, but the ISW amplitude needs to serve as 
preliminary information. Different theories have been developed for ISW propaga- 
tions in different ocean areas, each with its advantages and disadvantages. Here we 
introduce a new data-driven model to forecast ISW propagation. The forecast model 
was trained using big data collected from multi-source remote sensing imageries. 
The model performance shows better results than the traditional equations and is 
more robust for errors included in the model inputs. 

In the following chapter, we will first briefly overview the achievement of ISW 
studies using satellite observations. Then, machine learning techniques applied to 
the ISW studies will be introduced, and the establishment of the ISW forecast model 
will be presented. The model discussions and future works will be introduced in the 
last part of this chapter. 
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2 Satellite Observations of ISWs 


Space-borne synthetic aperture radar (SAR) and optical satellite images are com- 
monly used in ISW studies. SAR is an active microwave radar with all-weather, 
all-time, long-distance, and high-resolution detection advantages. The imaging prin- 
ciple of ISWs on SAR images is the Bragg backscattering mechanism [1]. Currents 
induced by ISW (usually refers to the first-mode depression ISW) modulate micro- 
scale waves on the sea surface, making convergent and divergent regions appear in 
the front and back of the wave, respectively. The sea surface roughness increases in 
the convergent region where the Bragg backscatter signal is enhanced and appears 
as a bright stripe on SAR images. The sea surface roughness decreases in the diver- 
gent region where the Bragg backscatter signal becomes weaker and appears dark on 
the SAR image. Figure la and b show that ISWs manifest as bright-dark stripes on 
SAR images. Although spatial resolutions of SAR images are relatively high, their 
time resolution is relatively low. The limited swath of SAR images also imposes 
restrictions on ISW studies. SAR images are generally used to study the mechanism 
of ISWs, such as inversion of amplitude [26], propagation speed [15], and energy 
analysis [17]. 

The imaging mechanism of ISW on optical images is the quasi-specular reflection. 
Optical remote sensing uses sunlight as the light source and receives ocean infor- 
mation from the sunlight reflected by the sea surface. The characteristics of ISW 
in optical images are more complicated than SAR images. Similarly, for depression 
ISW, the characteristics of different locations on the optical image are different. They 
may appear as bright-dark or dark-bright strips in sun-glint areas or non-sun-glint 
areas. Figure 1c and 1d show satellite observations of ISWs in the Sulu-Celebes Sea. 
Multiple ISW packets can be observed propagating in different directions with long 
wave crests. 

Optical remote sensing images have advantages of high time resolution and wide 
swath. Take Moderate-Resolution Imaging Spectroradiometer (MODIS) image as an 
example, its swath can reach 2330 km, and the same ocean area can be observed twice 
in one day. In addition, the spatial resolution of optical satellites launched in recent 
years has also been greatly improved. For example, the highest spatial resolution of 
GF-1 remote sensing images of China’s high-resolution series can reach 2 m, and the 
spatial resolution of GF-2 images has reached less than 1m. However, the optical 
satellite images are heavily affected by the weather conditions, such as clouds and 
rain, which will limit its observation capability. Benefiting from the wide swath 
and high temporal resolution of optical satellite images, the temporal and spatial 
distribution characteristics of ISWs were studied in different ocean areas. 

SAR and optical satellite images provide rich data sources for the research of 
ISWs. The generation mechanism, distribution, and propagation of the ISWs have 
been reported. Since the forecast of ISWs is significant for ocean environments, engi- 
neering, mixing, and military, the forecast of ISWs is meaningful. The ISW forecast is 
mainly conducted using empirical or numerical models [30]. In the SCS, ISWs have 
been predicted based on the west propagating barotropic tide in the Luzon Strait. ISW 
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Fig. 1 Satellite detection of ISWs in the Andaman Sea a, South China Sea b, Sulu Sea c, and 
Celebes Sea d. a composite map of Sentinel-1 images acquired on 10 March 2019; b ENVISAT 
ASAR image acquired on 5 May 2004; c MODIS image acquired on 14 March 2020; d MODIS 
image acquired on 28 March 2020 
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positions can be predicted according to the distance to the Luzon Strait. Based on 
the relationship between the ISW generation and the tide information, [6] forecasted 
the possible ISW occurrence in the north region of the Andaman Sea. The ISWs are 
generated at multiple sources and have complex patterns as revealed in Fig. 1. The 
complex pattern of ISWs makes the forecast more difficult. The complexity of mul- 
tiple sources and wave crest merging during the propagation make it hard to apply 
the empirical method. The numerical model needs large computation resources and 
is difficult to setup for ISWs propagating in the coastal areas. To overcome these 
difficulties, we proposed a machine learning model to forecast ISW propagation in 
different oceans with different ISW characteristics. 


3 Machine-Learning-Based ISW Forecast Model 


Machine learning techniques are fast evolving and have already demonstrated 
tremendous promise in oceanographic research [5, 23, 24]. Liu et al. [16] explored 
extracting coastal inundation mapping information from SAR imagery by applying 
deep learning techniques. Machine learning also allows for the creation of connec- 
tions between multi-dimensional data. Pan et al. [19] employed textural information 
taken from optical satellite images and ocean environmental factors to determine the 
amplitude of ISWs using the back-propagation (BP) algorithm. Li et al. [12] used a 
U-net-based method to obtain ISW wave crest from satellite images and conducted 
a thorough evaluation of the use of machine learning approaches in satellite image 
information mining. Machine-learning approaches have previously been proven to 
offer benefits in maritime applications due to their high nonlinear mapping ability 
and multi-dimensional data processing. 

The propagation of ISW is influenced by ISW features as well as the ocean 
influencing factors, such as topography and seasonal variations. Machine learning 
techniques are an excellent option to manage multi-dimensional impacting elements 
with no defined relationship, making them a strong choice for ISW propagation 
model construction. 


3.1 Model Establishment 


We use a fully connected neural (FCN) network to build relationships between the 
ISW propagations and its impacting factors [25, 29]. Figure 2 depicts a sketch map 
of the FCN network. We use the error back-propagation (BP) technique to train the 
FCN network. The desired output is calculated by the forward calculation procedure, 
and the back-propagation of errors between the model and the desired output is used 
to adjust the model. The weights of the neural network can be automatically adjusted 
based on the errors inversely fed into the model. The built dataset is categorized 
into the training dataset and validation dataset. The validation dataset watches the 
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Fig. 2, Sketch map of an FCN network. The different thickness of lines connecting different neural 
indicate different weights 


training procedure while the training dataset is employed for model training. The 
model will be trained for several epochs after being initialized randomly. The model 
will stop training and be further validated by a test dataset once it achieves the best 
validation results. 

The forecast model’s input parameters are made up of ISW property-related inputs, 
i.e., the Peak-to-Peak (PP) distance and the LWC [31], as well as ocean environment- 
related inputs such as longitude, latitude, mixed layer depth, density difference, and 
water depth. Satellite images may be used to measure an ISW’s LWC and PP distance. 
The water depth can be interpolated from the ETOPO! dataset. The World Ocean 
Atlas (WOA) 2018 dataset, where the temperature and salinity can be obtained, may 
be used to estimate ocean stratification. The buoyance frequency peak corresponds 
to the depth of the mixed layer. 

There are two modules in the model: propagation speed and direction (PS and PD) 
module. Seven input parameters are included and the output layer is the propagation 
speed and direction. The initial propagation direction is an extra input parameter in the 
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Fig. 3 Model structure for the ISW forecast model 


PD module that is used to resolve occasions with cross-propagating ISWs. Figure 3 
depicts the model framework for the ISW prediction model. Locations, ocean param- 
eters, and ISW characteristics of an initial ISW location found from satellite images 
may all be gathered and used as model inputs. We may run the model for several time 
steps to get the expected ISW positions at each time step using the model’s predicted 
ISW propagation velocity and direction. The Levenberg-Marquardt algorithm-based 
training function ‘trainlm’ was chosen for its quick convergence rate [22]. The hyper- 
bolic tangent sigmoid transfer (tansig) function is used to activate hidden layers, 
whereas the linear transfer (pureline) function is used to activate hidden and output 
layers. We used an early stopping strategy to combat the problem of over-fitting [20]. 


3.2 Model Training 


Both optical and SAR images can be applied to extract the training data. The MODIS 
sensors are onboard the National Aeronautics and Space Administration (NASA) 
satellites Terra and Aqua. The MODIS image has a swath of 2330 km and the highest 
spatial resolution of 250m. The Ocean and Land Color Instrument (OLCI) has five 
camera modules on board Sentinel-3. The OLCI has a swath of 1440 km and the best 
spatial resolution of 300m. We use 123 MODIS images and 33 OLCI images in the 
Andaman Sea and 149 MODIS images and 8 VIIRS images in the Sulu-Celebes Sea 
to build the dataset. 

On satellite images, ISWs appear as bright-dark bands. ISW locations can be saved 
as the GIS formatted file which is used to extract the spatial position. The LWC can 
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@ IW locations are extracted using ENVI or ArcGIS > 
and saved as Shapefile format which containing 
longitude and latitude information. 


@ The length of wave crest (LWC) can be calculated using 


the extracted wave crests. 


Fig.4 Procedure to build the training dataset from satellite images. a A subset of the MODIS image 
showing three ISW packets, the PP distance can be measured from the satellite image. The insert map 
shows the extracted profile of the ISW and how the PP distance is measured; b extracted locations of 
ISW, the LWC was calculated from extracted wave crests, and how the initial propagation direction 
is obtained; c subset of the extracted dataset, the brown shaded area shows model input parameters 
and the green shaded area indicates model output parameters 


be obtained using ISW labels and the PP distance equals the positive and negative 
peaks of the ISW profiles [31]. The propagation direction of the input ISW wave 
crest is introduced to solve cross-propagating ISW problems. A detailed procedure 
of how to build the training dataset is shown in Fig. 4. 

We estimate the phase speed of ISWs based on the difference of ISW locations 
and image acquisition time. One utilizes the location and time difference of the ISWs 
to get the ISW propagation speed if an ISW was detected on two quasi-synchronous 
images. We assume the time difference between two ISWs in the same satellite image 
equals the period of the semi-diurnal tide [7, 9]. In the Andaman Sea, 1189 samples 
were extracted, while in the Sulu-Celebes Sea, 1546 samples were extracted. The 
training and independent test datasets were created from these samples, which were 
split by 80/20%. 

Figure 5shows the results of PS and PD modules for ISWs in the Andaman Sea. 
The Sulu-Celebes Sea forecast model yields similar results. The root mean square 
(RMSE) of the training (test) datasets for the PS module is 0.19 m/s (0.20 m/s), while 
the correlation coefficients (CC) are 0.90 (0.88). The results demonstrate that the 
loss of the forecast decreases over time, and the PS module’s mean square error 
(MSE) obtains its optimal validation performance at epoch 37. The PD module has 
an RMSE of 10°, and the CCs are over 0.99. The gradient was steadily reduced, and 
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Fig. 5 Loss of the propagation velocity module aand the propagation direction module b for ISW 
forecast model 


at epoch 16, the PD module obtained its optimal performance. The MSE does not 
decrease and validation tests rise at epoch 16, the model stops training. 

ISWs in various ocean locations have varied features. In the Andaman Sea, we 
observe cross-propagating ISWs, while in the Sulu-Celebes Sea, we observe ISWs 
propagating in the opposite direction. We trained the model 30 times with and with- 
out initial primary propagation directions to see how they affected the predicted 
outcomes. Models having an initial major propagation direction as input performed 
better, with a reduced RMSE and more stable model performance, as illustrated in 
Fig. 6. The model without initial primary propagation directions had large deviations 
and lower correlation coefficients, indicating lower model generalizability. For cir- 
cumstances with cross-propagating ISW patterns, it is required to add the propagation 
direction of the ISW wave crest in the model inputs. 


3.3. Model Validation 


The forecast model was validated for ISWs in the Andaman Sea and Sulu-Celebes 
Sea. The ISWs created by successive semi-diurnal tides in the Andaman Sea is 
depicted in Fig. 7. On the MODIS image, three ISW wave packets propagating east- 
ward can be seen. IW1 and IW2 have LWCs of 146.41 and 242.26km. IW1 and 
IW2 have PP distances of 1391.48 and 731.54m. IW1 (IW2) parameters are used as 
model inputs, while [W2 (IW3) acts as model validation. In Fig. 7b and c, the time 
step is 6.21h, and ISW positions after one time step are depicted with dashed lines. 
The model results (satellite observations) after one semi-diurnal tide are indicated 
by the black (red) lines in Fig. 7b and c. The model predicted results appear to be in 
good agreement with satellite data. 
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Fig. 6 Model tests with (blue lines) or without (orange lines) the propagation direction of ISW 
wave crests 


We established three parameters, namely root-mean-square error (RMSE), Fréchet 
distance (FD), and CC, to qualitatively examine the effectiveness of the constructed 
ISW forecast model. The FD is a strict evaluation factor that considers the placement 
and order of points of the ISW wave crest. [W1 and IW2 have RMSEs of 6.10 and 
2.50km; the FDs is 18.28 and 9.06km, and the CCs are 0.96 and 0.89, respectively. 
In the Andaman Sea, we examined 8 examples for ISWs, including distinct locations 
with different ISW features. Table | shows the statistical findings. The average CC 
value is 0.95, and the average FD is 11.46km, showing that the model-predicted ISW 
positions and satellite observations have a good degree of agreement. 

The results of the model validation for ISWs propagating in the Sulu-Celebes Sea 
are shown in Fig. 8. Three ISW wave packets have been detected moving northward 
(southward) in the Sulu (Celebes) Sea. For these three wave packets, the leading ISWs 
are called IW1, [W2, and IW3. The model input is the wave crest [W1 (IW2), and the 
model validation is [W2 (IW3). With solid black lines, the model predicted results 
are illustrated in Fig. 8. We can observe that the model outputs and satellite data are 
typically in agreement. Table 2 shows the statistical results of the nine validation 
instances that we gathered. The RMSE is 12.92 km, the FD is 18.73 km, and the CC 
is 0.98, which indicates the model performs well. 

ISWs are consecutively generated by semi-diurnal cycles, so more than one tidal 
cycle can be observed by satellite images. We also tested the forecast model on these 
ISWs. The model runs iteratively to estimate ISW positions, the model predicted 
ISW locations is the input for the estimation of the next tidal cycles. The time 
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Fig.7 MODIS images acquired on 9 May, 2017. The ISW locations extracted from satellite images 
are represented with red lines. The solid (dashed) black lines represent forecast positions after 
12.42h (6.21h) 
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Table 1 Statistical results of forecast model in the Andaman Sea 


RMSE (km) Fréchet distance (km) Correlation Coefficient 
Case 1 2.69 13.90 0.95 
Case 2 2.99 10.66 0.97 
Case 3 6.10 18.28 0.96 
Case 4 2.50 9.06 0.89 
Case 5 3.15 10.43 0.93 
Case 6 2.37 6.42 0.93 
Case 7 4.12 12.66 0.97 
Case 8 1.79 10.28 0.99 
Average 3.21 11.46 0.95 


step is 6.21h, three MODIS images were utilized to assess model performance. 
The predicted ISW positions are displayed as solid black lines. The forecast model 
estimated ISW positions were consistent with satellite observations, as shown in 
Fig. 9a and b. The deviations between satellite observations and model estimations 
are particularly pronounced in Fig. 9c. Because of the complex terrain in the north 
region of the Andaman Sea, significant variations in the PP distance of ISWs may 
impact model estimations. 

The forecast model estimates were tested after two semi-diurnal tide cycles in 
the Sulu-Celebes Sea using a MODIS image collected on 29 October 2019 with 
distinct ISW signals. Figure 10a depicts the results. The Sulu-Celebes Sea ISW sites 
identified as IW1 are utilized as model input. In Fig. 10b, the ISW prediction is 
displayed after three semi-diurnal tidal cycles. After two or three semi-diurnal tidal 
cycles, the model estimations coincide well with satellite measurements. 


4 Influence Factors on the ISW Forecast Model 


While the model is validated and shows high accuracy as described above, some 
affecting factors will be discussed. When utilizing the estimated ISW positions of 
the first semi-diurnal tidal cycle as the model input for the following forecast, errors 
may be included. Based on the ISW locations, the locations and ocean environment 
characteristics could be changed, and the LWC could be computed. The ISW PP 
distance will remain constant in the forecast model’s subsequent iterative runs. The 
predicted outcome deviations will accumulate, resulting in more severe discrepancies 
in the subsequent iterative prediction. Influences of the time step, input parameter 
errors, the influence of seasonal variations, and comparison with the KdV equation 
were discussed in this section. 

Because ISWs are frequently generated by semidiurnal tidal cycles, the time 
step is set to 12.42h by default. When the time step is changed, we will see how 
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Fig. 8 Forecast model validation cases in the Sulu-Celebes Sea. MODIS image acquired on 18 
May 2015 and 29 October 2019 (upper left and right) and corresponding forecast results (lower 
left and right) 


it affects the model outcomes. Figure 11 depicts the effect of the model running 
with various time increments. We compared the model forecast results with satellite 
observations using time increments from 1/4 to one semi-diurnal tidal cycle. ISWs 
propagate from IW1 to IW2, the depth of water changes from about 2,000m to 
around 1,000m. The results are poorer using a time step equals 12.42h compared 
with smaller time steps. The model results were nearly the same when time steps 
were 3.11 and 4.14h, and the disparity with the time step equals 6.21 h was similarly 
low. As a consequence, we infer that the ISWs cross over the steep isobaths when the 
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Table 2 Statistical results for validation cases in the Sulu-Celebes Sea 


Date IWs RMSE (km) Fréchet Distance (km) Correlation 
coefficient 
Sulu Sea 18 May 2015 IWwl 14.98 16.27 1.00 
IW2 4.72 9.02 0.99 
04 Aug. 2016 |IW1 6.66 11.45 1.00 
29 Oct. 2019 IW1 14.57 25.77 0.97 
Iw2 5.56 6.15 1.00 
Celebes Sea 03 Mar. 2014 IW1 17.06 30.42 0.96 
24 Mar. 2015 IW1 17.05 23.24 0.93 
29 Oct. 2019 IW1 16.10 27.56 1.00 
IW2 19.54 18.73 1.00 
Average 12.92 18.73 0.98 
— 0 
Satellite results 


—— Model results | 
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Fig. 9 Model results for ISW propagation after two semi-diurnal tidal cycles 


terrain changes dramatically; a lower time step may enhance the prediction result. If 
the terrain changes gently, the time step may be set to be a large one. 

There are eight input parameters for the model which were taken from satellite 
images or publicly accessible datasets. The forecast model’s outcomes will be influ- 
enced by the inaccuracy made in the input parameters. Except for the PP distance 
and initial propagation direction, all input parameters were modified when we ran 
the model repeatedly. In the following model predictions, initial ISW PP distance 
and propagation direction at the starting point will be used as corresponding inputs. 
When ISWs were not clearly spotted owing to unsatisfied imaging conditions, the 
PP distance may introduce errors. In four locations of the Andaman Sea, the effects 
of mistakes in PP distances and ISW propagation direction on the model estimations 
were studied. Figure 12 depicts the results. 

An inaccuracy of +10° was considered to analyze its impacts on the model pre- 
dictions. The time step is 6.21 h, and the results are presented in Fig. 12a and b. The 
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Fig. 10 Iterative forecast results for ISWs propagating after a two and b three semi-diurnal tidal 
cycles. ISWs are extracted from MODIS images acquired on a 29 October 2019 and b 25 February 
2015. Black lines: model forecast results. Red lines: satellite observations of the ISW locations. 
Dashed lines: ISW locations every semi-diurnal tidal cycle 


model predicted ISW positions that were near to each other and satellite data which 
promise the model tolerance on initial propagation direction errors. 

Given that a one-pixel inaccuracy resulted in an error of +300 m to the PP dis- 
tances, we compare the results with different PP distance inputs. In the Andaman Sea, 
two locations with significant and modest water depth fluctuations were examined. 
The results reveal that, despite the varying inaccuracies in the ISW PP distance, the 
predicted ISW positions were close to satellite measurements. 

The results demonstrate that the proposed forecast model is extremely forgiving of 
inaccuracies in input parameters like the ISW PP distance and propagation direction 
of the input ISW wave crest. Minor inaccuracies in some input parameters had no 
effect on the model’s performance because the ISW propagation was defined by 
eight factors. Despite this, the model produced results that were close to satellite 
data. It’s important to remember that input parameter mistakes will accumulate over 
time when a model runs iteratively. However, the model’s predictions were remained 
valid after two or three tidal cycles, according to results presented in Fig. 10. 

The stratification of the ocean fluctuates because of precipitation and other rea- 
sons, ISW propagation is impacted by seasonal variations. The dry season in the 
Andaman Sea starts from January to April and the rainy season start from May to 
November. We estimate ISW propagation in four Andaman Sea locations throughout 
the dry and wet seasons to see how seasonal differences affected the model’s results. 

March (August) was selected to represent the dry (rainy) season. The density 
information was calculated from the WOA2018 dataset. In the two seasons, there 
were differences in the depth and density of the mixed layer. The predicted outcomes 
of the forecast model are displayed in Fig. 13. The model-predicted ISW positions 
were near to each other and also close to the satellite observations in Cases 13a, 
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Fig. 11 Comparison of the model forecast results with different time steps 


13b, and 13d. The ISW predictions in Fig. 13c exhibited more significant dispari- 
ties between the outcomes. The findings suggest that ISW propagation shows small 
seasonal fluctuations in cases a, b, and d, whereas there are disparities in cases c. 

Figure 13 depicts the buoyancy frequency of four locations in two seasons. The 
buoyancy frequency distribution in Fig. 13g and h showed more significant incon- 
sistencies than in Figs. 13e and f. In the dry season, Fig. 13g and 13h had two peaks, 
while in the rainy season, there was only one peak. The most significant difference is 
found in Fig. 13g, indicating the most substantial fluctuations in ocean stratification. 
A larger buoyancy frequency peak indicates stronger ocean stratification. The ISW 
propagates faster with a larger density difference [27]. This explains why, in Fig. 13c, 
ISW spread quicker during the rainy season. 
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Fig. 12 Influence of input parameter errors for initial main propagation directions (a, b) and PP 
distances (c, d) 


The nonlinear propagation velocity of ISWs are described by the KdV equation 
[3, 18]: x 
Cp = Co + 3 Ao- (5) 


Where Co is the linear phase speed, a is the nonlinear coefficient, and Ao is the ISW 
amplitude. The ISW propagation velocity is related to the ISW amplitude which is 
normally unknown [21]. Based on previous studies [2, 14], the amplitudes of ISWs 
in the Sulu-Celebes Sea vary from 30 to 90m. To compare the predicted results, 
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Fig. 14 Comparison of the forecast results of the proposed model and the KdV equation (upper 
panels) and sensitivity of the model results to the ISW amplitude (lower panels) 


we set the ISW amplitude to 60m. Figure 14 depicts the comparison results. The 
RMSE (FD) between the KdV-predicted ISW positions and satellite observations 
is 44.06km (59.42km). The developed model estimation has an RMSE (FD) of 
14.67 km (28.35 km). The findings reveal that the proposed model’s predicted ISW 
positions are closer to satellite results. The KdV equation produces a larger ISW 
propagation velocity error as compared to satellite data. 

ISW amplitudes range from tens to hundreds of meters. The anticipated ISW 
positions will deviate as a result of this uncertainty. Generally, we assume the linear 
propagation speed to 2.5 m/s, the mixed layer depth to 100m, and the water depth is 
3000 m. After one semi-diurnal tidal cycle, a 20m uncertainty in the ISW amplitude 
results in an ISW position error of 32.42km. Figure 14 shows the forecast results 
for three ISW amplitudes (30, 60, and 90m) using the KdV equation. The results 
reveal that ISW locations predicted by the KdV equation are sensitive to ISW ampli- 
tudes. Because of the amplitude uncertainty, the predicted ISW positions will shift. 
Without any unknown characteristics, input parameters of the proposed model may 
be assessed from satellite observations or publically accessible datasets. When input 
parameters include errors, the proposed forecast model is more resilient. 
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5 Conclusions and Future Works 


To forecast ISW propagations, the ISW forecast model was built using machine 
learning techniques in this chapter. The training dataset was built using samples 
extracted from satellite observations and the publicly accessible datasets ETOPO1 
and WOA 2018. An FCN network with eight input parameters, which include ocean 
elements and ISW features, was proposed. The proposed forecast model can predict 
ISW positions after propagating several time steps when given an initial ISW position. 
The model’s estimation is close to satellite data. 

The impact of the model’s time step on the predicted outcomes was investigated. 
When propagating ISWs pass over isobaths, a smaller time step yields better results. 
Measurements of the PP distance and propagation direction of the given ISW wave 
crest are easy to have errors. The PP distance and propagation direction of the given 
ISW wave crest is not modified in subsequent predictions when the forecast model 
runs repeatedly. The impact of input errors on model estimations was also investi- 
gated. The findings reveal that the proposed model is not sensitive to input parameter 
mistakes. An error of +300m on the PP distance and 10° on the propagation direc- 
tion did not affect the model estimation greatly. This result demonstrates that the 
proposed forecast model can still provide reliable results with errors included. The 
impact of seasonal variation on ISW propagation was investigated. The findings sug- 
gest that differences were discovered as a result of seasonal fluctuations in ocean 
stratification. Comparison with the KdV equation indicates that the forecast model 
produced superior forecast results and was more resilient. 

In contrast to numerical models, the forecast model does not need prior knowledge, 
a rigid boundary, or beginning conditions. Only satellite observations and publicly 
available datasets are used. Hence, the model provides an alternate but easy way 
to forecast ISW propagation and can be readily modified to apply to other ocean 
regions. The initial position of an ISW wave crest is all that is required to run our 
forecast model. The machine learning algorithms used here have a lot of potential in 
oceanographic research for multi-dimensional data processing and forecasting. 
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1 Introduction 


The ocean acts as a heat sink and is vital to the Earth’s climate system. It regulates 
and balances the global climate environment through the exchange of energy and 
substances in the atmosphere and the water cycle. As a huge heat storage, the ocean 
collects most of the heat from global warming and is sensitive to global climate 
change. The global ocean hold over 90% of the Earth’s increasing heat as a response 
to the Earth’s Energy Imbalance (EEI), leading to substantial ocean warming in 
recent decades [24, 36]. Subsurface thermohaline are basic and essential dynamic 
environmental variables for understanding the global ocean’s involvement in recent 
global warming caused by the greenhouse gas emissions. Moreover, many significant 
dynamic processes and phenomena are located beneath the ocean’s surface, and there 
are many multiscale and complicated 3D dynamic processes in the ocean’s interior. 
To completely comprehend these processes, it is necessary to accurately estimate the 
thermohaline structure in the global ocean’s interior [43]. 

The ocean has warmed dramatically as a result of heat absorption and seques- 
tration during recent global warming. Meanwhile, the heat content of the ocean has 
risen rapidly in recent decades [3, 14]. The global upper ocean warmed significantly 
from 1993 to 2008 [6]. The rate of heat uptake in the intermediate ocean below 300m 
has increased much more in recent years [2]. It shows that the warming of the ocean 
above 300m slows down, while the warming of the ocean below 300m speeds up. 
The ocean system accelerates heat uptake, leading to significant and unprecedented 
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heat content increasing and worldwide ocean warming, particularly in the subsur- 
face and deeper ocean. This has caused the global ocean heat content hitting a record 
high in recent years [14, 15]. In addition, the ocean salinity as another key dynamic 
variable is also crucial for investigations on ocean variability and warming. The salin- 
ity mechanism has been proposed to expound how the upper ocean’s warming heat 
transferred to the subsurface and deeper ocean [12], which highlights the importance 
of salinity distribution in the heat redistribution and the process of ocean warming. 
Furthermore, the global hydrological cycle is modulated by ocean salinity [4]. The 
thermohaline expansions, which contribute significantly to sea-level rise, are also 
linked to ocean temperature and salinity [9]. Therefore, to improve the understand- 
ing of the dynamic process and climate variability in subsurface and deeper ocean, 
deriving and predicting subsurface thermohaline structure is critical [31]. 

Due to the sparse and uneven sampling of float observations and the lack of time- 
series data in the ocean, there are still large uncertainties in the estimation of the 
ocean heat content and the analysis of the ocean warming process [13, 42]. In the 
era of ship-based measurement, large areas of the global ocean are without or lack of 
in-situ observation data, especially in the Southern Ocean. The data obtained by the 
traditional ship-based method not only has limited coverage, but also can’t achieve 
uniform spatiotemporal measurement, hindering the multi-scale studies on the ocean 
processes. Since 2004, the Argo observation network has achieved the synchronous 
observation for the upper 2000m of the global ocean in space and time [39, 51]. 
However, the number of Argo floats is currently insufficient and far from enough for 
the global ocean observation, which cannot provide high-resolution internal obser- 
vation and cannot meet the requirements of global ocean processes and climate 
change study. Given that satellite remote sensing can obtain large-scale sea surface 
range and high-resolution sea surface observation data, satellite remote sensing has 
become an essential technique for ocean observation. Although sea surface satellites 
can provide large-scale, high-resolution sea surface observation data, they cannot 
directly observe the ocean subsurface temperature structure [1]. Since many subsur- 
face phenomena have surface manifestations that can be interpreted with the help of 
satellite measurements, it is able to derive the key dynamic parameters (especially 
the thermohaline structure) within the ocean from sea surface satellite observations 
by certain mechanism models. Deep ocean remote sensing (DORS) has the ability 
to retrieve ocean interior dynamic parameters and enables us to characterize ocean 
interior processes and features and their implications for the climate change [25]. 

Previous studies have demonstrated that the DORS technique has a great potential 
to detect and predict the dynamic parameters of ocean interior indirectly based on 
satellite measurements combined with float observations [41, 43]. DORS methods 
mainly include numerical modeling and data assimilation [25], dynamic theoretical 
approach [30, 48, 50], and empirical statistical and machine learning approach [23, 
41]. The accuracy of numerical and dynamic modeling for subsurface ocean sim- 
ulation and estimation at large scale is not guaranteed due to the complexity and 
uncertainty of these methods. Reference [47] empirically estimated mesoscale 3D 
oceanic thermal structures by employing a two-layer model with a set of parameters. 
Reference [35] determined the vertical structure and transport on a transect across 
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the North Atlantic Current by integrating historical hydrography with acoustic travel 
time. Reference [34] estimated the 4D structure of the Southern Ocean from satellite 
altimetry by a gravest empirical mode projection. However, in the big ocean data 
and artificial intelligence era, data-driven models, particularly cutting-edge artificial 
intelligence or machine learning models, perform well and can reach high accu- 
racy in DORS techniques and applications. So far, the empirical statistical and AI 
models have been well developed and applied, including the linear regression model 
[19, 23], empirical orthogonal function-based approach [32, 37], geographically 
weighted regression model [43], and advanced machine learning models, such as 
artificial neural networks [1, 45], self-organizing map [10], support vector machine 
[28, 41], random forests (RFs) [43], clustering neural networks [31], and XGBoost 
[44]. Although traditional machine learning methods have made significant contribu- 
tions to DORS techniques, they are unable to consider and learn the spatiotemporal 
characteristics of ocean observation data. In the big earth data era, deep learning 
has been widely utilized for process understanding for data-driven Earth system sci- 
ence [38]. Deep learning techniques offer great potential in DORS studies to help 
overcome limitations and improve performance [46]. For example, Long Short-Term 
Memory (LSTM) can well capture data time-series features and achieves time-series 
learning [8], and Convolutional Neural Networks (CNN) take into account data spa- 
tial characteristics to easily realize spatial learning [5]. Deep learning technique has 
unleashed great potential in data-driven oceanography and remote sensing research. 

This chapter proposes several novel approaches based on ensemble learning and 
deep learning to accurately retrieve and depict subsurface thermohaline structure 
from multisource satellite observations combined with Argo in situ data, and high- 
light the AI applications in the deep ocean remote sensing and climate change studies. 
We aim to construct Al-based inversion models with strong robustness and general- 
ization ability to well detect and describe the subsurface thermohaline structure of the 
global ocean. Our new methods can provide powerful AI-based techniques for exam- 
ining subsurface and deeper ocean thermohaline change and variability which has 
played a significant role in recent global warming from remote sensing perspective 
on a global scale. 


2 Study Area and Data 


The ocean plays a significant role in modulating the global climate system, especially 
during recent global warming and ocean warming [51]. It serves as a significant heat 
sink for the Earth’s climate system [12], and also acts as an important sink for the 
increasing CO, caused by anthropogenic activities and emissions. The study area 
focused here is the global ocean which includes the Pacific Ocean, Atlantic Ocean, 
Indian Ocean, and Southern Ocean (180° W~180° E and 78.375° S~77.625° N). 
The satellite-based sea surface measurements adopted in this study include sea sur- 
face height (SSH), sea surface temperature (SST), sea surface salinity (SSS), and sea 
surface wind (SSW). Here, the SSH is obtained from AVISO satellite altimetry. The 
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SST is acquired from Optimum Interpolation Sea-Surface Temperature (OISST) data. 
The SSS is obtained from the Soil Moisture and Ocean Salinity (SMOS). The SSW is 
acquired from Cross-Calibrated Multi-Platform (CCMP). The longitude (LON) and 
latitude (LAT) georeference information are also employed as supplementary input 
parameters. All sea surface variables above have the same 0.25° x 0.25° spatial reso- 
lution. The subsurface temperature (ST) and salinity (SS) data are from Argo gridded 
products with 1° x 1° spatial resolution. This study adopted Argo gridded data for 
subsurface ocean upper 1,000m with 16 depth levels as labeling data. We initially 
applied the nearest neighbor interpolation approach to unify the satellite-based sea 
surface variables to 1° x 1° spatial resolution. 

All the aforementioned satellite-based sea surface variables and Argo gridded data 
should be subtracted their climatology (baseline: 2005-2016) to obtain their anomaly 
fields in order to avoid the climatology seasonal variation signal [41]. In this study, 
We primarily focus on the nonseasonal anomaly signals, which are more difficult to 
detect but more significant for climate change. We applied a maximum-minimum 
normalization approach to normalize the training dataset to the range of [0, 1]. The 
testing dataset was likewise subjected to the corresponding normalization, which can 
effectively prevent data leakage during the modeling. 


3 Retrieving Subsurface Thermohaline Based on Ensemble 
Learning 


Here, the specific procedure of subsurface thermohaline retrieval based on machine 
learning approaches contains three technical steps. Firstly, the training dataset for 
the model was constructed. We selected the satellite-based sea surface parameters 
(SSH, SST, SSS, SSW) as input variables for AI-based models, and the subsurface 
temperature anomaly (STA) and salinity anomaly (SSA) from Argo gridded data 
were adopted as data labels for training and testing. Moreover, all the input surface 
and subsurface datasets were uniformly normalized and randomly separated into 
a training dataset (60%) and a testing dataset (40%), which were utilized to train 
and test the Al-based models, respectively. Secondly, the model was trained using 
the training dataset. The model’s hyper-parameters were tuned by using Bayesian 
optimization approach, and then a proper machine learning model was well set up 
using the optimal input parameters. Finally, the prediction was performed based on 
the trained model. We predicted the STA and SSA by the optimized model, and then 
evaluated the model performance and accuracy by determination coefficient (R?) and 
root-mean-square error (RMSE). 


3.1 EXtreme Gradient Boosting (XGBoost) 


Gradient Boosting Decision Tree (GBDT) as a boosting algorithm is an iterative 
decision trees algorithm and is composed of multiple decision trees [16]. EXtreme 
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Fig. 1 Spatial distribution of the a Argo STA and the b XGBoost-estimated STA in December 
2015 at 600m depth 


Gradient Boosting (XGBoost) is an upgraded GBDT ensemble learning algorithm 
[11], as well as an optimized distributed gradient boosting library. XGBoost imple- 
ments an ensemble machine learning algorithm based on decision tree that adopts 
a gradient boosting framework, and also provides a parallel tree boosting that solve 
many data science problems in an efficient, flexible and accurate way. To achieve the 
optimal model performance, the parameter tuning is essential during the modeling. 
XGBoost contains several hyper-parameters which are related to the complexity and 
regularization of the model [49], and they must be optimized in order to refine the 
model and improve the performance. Here, we used the well-performed Bayesian 
optimization approach to tune the XGBoost hyper-parameters. 

Figures 1-2 show the spatial distribution of subsurface temperature and salinity 
anomalies (STA and SSA) of the global ocean from the XGBoost-based result and 
Argo gridded data in December 2015 at 600 m depth. It is clear that both the XGBoost- 
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Fig. 2 Spatial distribution of the a Argo SSA and the b XGBoost-estimated SSA in December 
2015 at 600 m depth 


estimated STA and SSA were significantly consistent with the Argo gridded STA and 
SSA at 600m depth. The R? of STA/SSA between Argo gridded data and XGBoost- 
estimated result is 0.989/0.981, and the RMSE is 0.026 °C/0.004 PSU. 


3.2 Random Forests (RFs) 


Random Forests (RFs) are a popular and well-used ensemble learning method for 
data classification and regression. Reference [7] proposed the general strategy of RFs, 
which fit numerous decision trees on various data subsets by randomly resampling 
the training data. RFs adopt averaging to improve the prediction accuracy and control 
overfitting, and correct for the decision tree’s tendency of overfitting. RFs have been 
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Fig. 3 Spatial distribution of the a Argo STA and the b RFs-estimated STA in June 2015 at 600m 
depth 


effectively applied in varying remote sensing fields [21, 53] and generally perform 
very well. Several advantages make RFs well-suited to remote sensing studies [20, 
52]. 

The basic strategy of RFs is to grow a number of decision trees on random subsets 
of the training data [40], and determine the decision rules, and choose the best split 
for each node splitting [29]. This strategy performs well compared to many other 
classifiers and makes it robust against overfitting [7]. RFs only require two input 
parameters for training, the number of trees in the forest (n;-ee) and the number of 
variables/features in the random subset at each node (m,,), and both parameters are 
generally insensitive to their values [29]. 

Figures 3—4 show the spatial distribution of subsurface thermohaline anomalies 
of the global ocean from RFs-based result and Argo gridded data in June 2015 at 
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Fig. 4 Spatial distribution of the a Argo SSA and the b RFs-estimated SSA in June 2015 at 600m 
depth 


600 m depth. It is clear that the spatial distribution and pattern between RFs-estimated 
results and Argo gridded data are quite similar. The R? of STA/SSA between Argo 
data and XGBoost-estimated result is 0.971/0.972, and the RMSE is 0.042 °C/0.005 
PSU. 
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4 Predicting Subsurface Thermohaline Based on Deep 
Learning 


The predicting process for subsurface thermohaline based on deep learning includes 
three steps. Firstly, the training dataset combined satellite-based sea surface parame- 
ters (SSH, SST, SSS, SSW) with Argo subsurface data as training label were prepared. 
Secondly, we carried out a hyperparameter tuning based on a grid-search strategy to 
achieve an optimal deep learning model by training. Here, we set up the time-series 
deep learning models by adopting the time-series data as the training dataset and 
the rest as the testing dataset, so as to realize time-series subsurface thermohaline 
prediction. Finally, the performance measures of RMSE and R? were adopted to 
evaluate the model performance and accuracy. 


4.1 Bi-Long Short-Term Memory (Bi-LSTM) 


The LSTM is a sort of recurrent neural network [22], which is well-suited to time- 
series modeling and has been widely applied in natural language processing and 
speech recognition. The primary principle behind LSTM is to leverage the target 
variable’s historical information. Unlike traditional feedforward neural networks, 
the training errors in an LSTM propagate over a time sequence, capturing the time- 
dependent relationship of the training data’s historical information [18]. Bi-Long 
Short-Term Memory (Bi-LSTM) is an upgraded LSTM algorithm. The Bi-LSTM 
consists of two unidirectional LSTM that processes the input sequence forward and 
backward meanwhile, and captures the information ignored by the unidirectional 
LSTM. 

To ensure the Bi-LSTM model can achieve good performance and high accuracy, 
it is necessary to select and tune the proper hyperparameters as the input of Bi-LSTM 
model. Here, we randomly picked 20% of the training dataset for Bi-LSTM hyperpa- 
rameter tuning, so as to achieve the optimal model input. The Bayesian optimization 
approach was utilized in this study to obtain the best number of layers and neurons 
for Bi-LSTM network. By model testing, we finally selected a neural network with 
three layers and neuron counts of 32, 64, and 64 for respective layer. Moreover, the 
batch normalization was conducted after the hidden layer of each network. Accord- 
ing to the previous practice, the optimal performance could be effectively attained 
with mini-batch sizes ranging from 2 to 32 [33]. Thus, the best batch size was set 
to 32 for the model. In addition, the optimal epoch of the STA network was set to 
257, while the best one of the SSA network was set to 81. Moreover, We adopted 
the RMSE, R?, and Spearman’s rank correlation coefficient (p) to obtain the opti- 
mal Bi-LSTM timestep. The results demonstrate that the Bi-LSTM model performs 
optimally when the network timestep is set to 10. Thus, the timestep here was set as 
10. 
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Table 1 The different datasets feeded to Bi-LSTM model 


Target month Training dataset Predicting dataset Testing dataset 


2015.12 2010.12-2015.11 2015.03-2015.12 2015.12 


We employed the data from December 2010 to November 2015 as the training 
dataset and the data in December 2015 as the testing dataset. The testing dataset 
adopted the target month dataset for performance evaluation (Table 1). In general, 
Bi-LSTM was characterized by a whole temporal sequence in both training and 
prediction, but for the accuracy validation, we only focused on the target month. 


60°E 180° 


b 


Fig. 5 Spatial distribution of the a Argo STA and the b LSTM-predicted STA in December 2015 
at 200m depth 
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Fig. 6 Spatial distribution of the a Argo SSA and the b LSTM-predicted SSA in December 2015 
at 200m depth 


When constructed the input dataset for Bi-LSTM, we restructured the data grid by grid 


sa : j=l yj=2 yj=60 j=l yj=00 
with time sequence according to the rule of X/=,, X/2,...X/2), XID XID e 


X ET € T (i represents the grid point, j represents the month). 

Figures 5—6 show the spatial distribution of subsurface temperature and salinity 
anomalies of the global ocean from the LSTM-predicted result and Argo gridded 
data in December 2015 at 200 m depth. It is clear that the LSTM-predicted result can 
accurately retrieve and capture most anomaly signals in the subsurface ocean. The R? 
of STA/SSA between Argo gridded data and LSTM-predicted result is 0.728/0.476, 
and the RMSE is 0.378°C/0.055 PSU. 

Figure 7 is the meridional profile (at longitude 190°) for Argo gridded and LSTM- 
predicted STA for vertical comparison and validation. The results presented that the 
two vertical profiles are highly consistent in the vertical distribution pattern, and over 
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Fig.7 The meridional vertical profile of the STA in December 2015 at the longitude of 190° for a 
Argo gridded data, and b LSTM-predicted result 


99.75% of the profile points were within +1 °C prediction error, while over 99.44% of 
the profile points were within +0.5 °C error. Figure 8 is the same meridional profile for 
Argo gridded and LSTM-predicted SSA for vertical comparison and validation. The 
results indicated that the two vertical profiles match well in the vertical distribution 
pattern, and over 99.55% of the profile points were within +0.2 PSU prediction 
error, while over 99.36% of the profile points were within 0.1 PSU error. The 
results demonstrated that the model prediction performance for STA and SSA are 
excellent with high accuracy. 
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Fig. 8 The meridional vertical profile of the SSA in December 2015 at the longitude of 190° for 
a Argo gridded data, and b LSTM-predicted result 


4.2 Convolutional Neural Network (CNN) 


Convolutional Neural Network (CNN) is a well-known deep learning algorithm. 
[17] proposed a neural network structure, including convolution and pooling layers, 
which can be regarded as the first implementation of the CNN model. On this basis, 
[27] proposed the LeNet-5 network, which used the error backpropagation algorithm 
in the network structure and was considered a prototype of CNN. Until 2012, the 
deep network structure and dropout method were applied in the ImageNet image 
recognition contest [26], and significantly reduced the error rate, which opened a 
new era in the image recognition field. So far, the CNN technique has already been 


118 H. Suet al. 


ATLANTIG ù 
OCEAN if 
- À is 
, T PACIFIC le 


5 È k. <= 
— a. OCEAN: l 
A 4 Tar es J: i i J s 
q p Govan Ee a 


OCEAN cen aa e 


S cae i ; 
gt f ATLANTIG a 
by A OCEAN 
5 "E ; LN " ” 
À H NN 
gore 


7 


A By. OCEAN ; 
eer Rar i ey Sate j T 


OCEAN rer 6 


Fig. 9 Spatial distribution of the a Argo ST and the b CNN-predicted ST in December 2015 at 
200m depth 


widely utilized in a variety of applications, including climate change and marine 
environmental remote sensing applications [5]. Here, the CNN algorithm combined 
with satellite observations was employed to predict ocean subsurface parameters. 

We utilized the CNN approach to retrieve ocean subsurface temperature (ST) 
and salinity (SS) based on satellite remote sensing data directly. Figures 9-10 show 
the spatial distribution of subsurface thermohaline of the global ocean from the 
CNN-predicted and Argo gridded data in December 2015 at 200m depth. The R? of 
STA/SSA between Argo gridded data and CNN-predicted result is 0.972/0.822, and 
the RMSE is 0.924°C/0.293 PSU. 
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Fig. 10 Spatial distribution of the a Argo SS and the b CNN-predicted SS in December 2015 at 
200m depth 


5 Conclusions 


This chapter proposes several AI-based techniques (ensemble learning and deep 
learning) for retrieving and predicting subsurface thermohaline in the global ocean. 
The proposed models are proved to estimate the subsurface temperature and salinity 
structures accurately in the global ocean through multisource satellite remote sens- 
ing observations (SSH, SST, SSS, and SSW) combined with Argo float data. The 
performance and accuracy of the models are well evaluated by Argo in situ data. The 
results demonstrate that the AI-based model has strong robustness and generalization 
ability, and can be well applied to the prediction and reconstruction of subsurface 
dynamic environmental parameters. 
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We employ XGBoost and RFs ensemble learning algorithms to derive the sub- 
surface temperature and salinity of the global ocean, and the R?/RMSE of XGBoost 
retrieved STA and SSA are 0.989/0.026°C and 0.981/0.004 PSU, and the R*/RMSE 
of RFs retrieved STA and SSA are 0.971/0.042 °C and 0.972/0.005 PSU. Moreover, 
Bi-LSTM and CNN deep learning algorithms are adopted to time-series predicting 
of subsurface thermohaline, the R?/RMSE of Bi-LSTM predicted STA and SSA are 
0.728/0.378°C and 0.476/0.055 PSU, the R? / RMSE of CNN predicted ST and SS 
are 0.972/0.924°C and 0.822/0.293 PSU (CNN to predict the ST and SS directly). 
Overall, ensemble learning algorithms which are suited for small data modeling can 
be used to well retrieve mono-temporal subsurface thermohaline structure, while 
deep learning algorithms which are fit for big data modeling can be well adopted to 
predict time-series subsurface thermohaline structure. 

In the future, we can employ longer time-series of remote sensing data for mod- 
eling and utilize more advanced deep learning algorithms to improve the model 
applicability and robustness. We should further promote the application of AI and 
deep learning techniques in the deep ocean remote sensing and data reconstruction 
for revisiting global ocean warming and climate change. The powerful AI technol- 
ogy shows great potential for detecting and predicting the subsurface environmental 
parameters based on multisource satellite measurements, and can provide a useful 
technique for promoting the studies of deep ocean remote sensing as well as ocean 
warming and climate change during recent decades. 
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1 Introduction 


In recent decades, the imbalance in the top-of-atmosphere radiation, termed the 
Earth’s energy imbalance (EEI) [49], has been continuously promoted changes in 
the global climate system, leading to continued global warming. The EEI must be 
accurately quantified in order to investigate and comprehend the past, present, and 
future state of climate change [38], which is defined by the net heat gaining in the 
Earth’s climate system calculating the difference between the energy entering into 
and reflected by the Earth [50]. Due to its small magnitude compared with solar 
radiation, the EEI is difficult to quantify accurately [38]. Yet, more than 93% of the 
EEFI of the Earth system is sequenced in the ocean as ocean heat content (OHC) 
changes [5, 7]. Naturally, this is due to the large heat capacity and gigantic volume 
of seawater, which accounts for ~71% of the world’s surface area and ~97% of total 
water volume. Therefore, the OHC variability is slower and can better capture low- 
frequency climate variability. These make OHC the most suitable variable to detect 
and track EEI changes than sea surface temperature (SST) [26, 50]. 

OHC is driven by both human activity and natural variability. The anthropogenic 
forcing has been reflected in the OHC, leading to speeding OHC warming rate [35], 
and therefore the former serves as an essential indicator of ocean variability. In turn, 
OHC also feedbacks to the climate change [10, 25]. On the multi-decadal timescale 
natural variability, the global OHC is of high relevance to the Earth’s heat balance 
[3]. OHC is also closely associated with the El Niño Southern Oscillation (ENSO), 
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dominating the interannual variability. In recent years, hemispheric asymmetry of 
OHC changes has emerged, which can be likely explained by internal dynamics 
instead of different surface forcing [39]. The world ocean of 2021 was the hottest 
ever recorded by human beings despite the La Nifia conditions [17]. In summary, the 
accurate quantification of the OHC is crucial to understanding EEI [10, 50]. 

Remote sensing can provide wide and near-real-time coverage as well as a vast 
collection of spatial and temporal information. However, in most circumstances, 
only water surface can be seen from the remote sensing which cannot penetrate to 
the ocean’s interior. Recently, a series of subsurface and deeper ocean remote sensing 
(DORS) methods were developed to unlock the enormous potential of remote sensing 
data in sensing the ocean interior [31]. Particularly, Artificial Intelligent (AI) methods 
have provided cutting-edge tools and infrastructures [41]. 

Different remote sensing data were applied to derive subsurface thermal infor- 
mation via various methods [18, 31, 32, 52]. These early initiatives demonstrated 
the concept of DORS to tackle the issue of data sparsity. Recent research has 
shown that surface remote sensing data may be effectively used to retrieve STA via 
machine-learning or AI approaches. For instance, [27] and [23] proved that subsur- 
face structures were dominated by the first baroclinic mode, and thus can be estimated 
from SSH. By merging remote sensing data with a Self-Organization Map (SOM) 
approach, [51] further confirmed the theory’s credibility in the Northern Atlantic 
Ocean. Reference [36] used a clustered shallow neural network (NN) to obtain sub- 
surface temperature, demonstrating the promise of NNs as a category of generic 
techniques with powerful regression capabilities. Other relevant contributions have 
been made [2, 8, 20, 21, 23], to mention a few. 

Among the several methodologies, neural networks (NN), as the foundation of 
contemporary deep learning breakthroughs, have demonstrated the capability in the 
regression problems of the ocean subsurface estimation [2, 27-29, 48]. Yet, the 
application of NN models to temporally extrapolate remote sensing data was limited. 
This is partly because of the difficulty in time series estimation, and the fact that STA 
was indirectly influenced by the surface signals in the deep ocean. The exact physical 
controls are fundamentally nonlinear. In this regard, OHC is more tightly coupled 
with the surface forcing [40], which may lead to a more physically consistent DORS 
application. So far, only a few studies have used surface data to retrieve OHC [27, 54]. 
In the ground-breaking work of [27], an NN was trained to derive discrete site-wise 
Indian Ocean OHC. Given that different ocean basins have different OHC dynamics 
and thus linkages to the surface, the first goal of this study is to determine whether 
this approach can be extended to the entire global ocean. We will answer this question 
by developing an NN model driven by big data to estimate OHC, that is accurate for 
the global ocean and for temporally extending OHC data to the pre-Argo era of 1993 
onward to 2004. 

The method will also be used to generate an OHC product using this NN 
approach, hindcasting the OHC before the Argo era. Since the early 2000s, when 
Argo floats have been continuously deployed, the ability to accurately quantify OHC 
has unprecedented increased [44]. To present, a network of over 4000 Argo floats 
has been detecting robust climate signals in the global ocean’s large-scale dynamic 


Ocean Heat Content Retrieval from Remote Sensing Data Based on Machine Learning 127 


features. However, prior to the Argo era, there was no reliable full-coverage ocean 
interior data. As a result, there are discussions and debates regarding various climate 
issues. Consider the trend of heat redistribution during the “hiatus” period between 
1998 and the late 2010s, when global warming appeared to be slowing [53]. Vari- 
ous climate signals have been detected, each backed by different data products [45], 
which puts a strain on the quality of ocean interior data in order to give compre- 
hensive and effective support for climate research throughout this time period [53]. 
For example, there are two broad opinions on driving processes: one is the Atlantic 
meridional overturning circulation-controlled mechanism [11], and the other is Indo- 
Pacific-originated mechanisms [33]. 

In this chapter, we describe the NN model yielding an NN-based global OHC 
product named Ocean Projection and Extension neural Network (OPEN) [47]. The 
technical details will be described with a focus on the NN approach. This chapter 
is structured as follows. After presenting data in Sect. 2, the NN method is detailed 
in Sect. 3. We also present the design of experiments to optimize the network. In 
Sect. 4, we first test the sensitivity of the network parameters and structure. The OHC 
is then reconstructed, extended to the pre-Argo era, from 1993 to 2020. In addition, 
OPEN and other renowned near-global OHC products are evaluated in terms of linear 
trends and variability modes. Finally, in Sect. 5, we summarize the results and provide 
prospects for future studies. 


2 Data 


A summary of all data sets utilized in this chapter is shown in Table 1, including 
an Argo-based three-dimensional temperature product to derive OHC, multi-source 
satellite remote sensing data, and OHC products from different sources. 

The sea surface height (SSH) is from the Absolute Dynamic Topography products 
of Archiving, Validation, and Interpretation of Satellite Oceanographic (AVISO). The 
SST is from the Optimum Interpolation Sea Surface Temperature (OISST). The sea 
surface wind (SSW) is from the Cross Calibrated Multi-platform (CCMP). These 
three products have a common spatial resolution of one quarter. The sea surface 
salinity (SSS) is adopted from the Soil Moisture Ocean Salinity (SMOS) product. 
The SMOS product has a spatial resolution of one degree. We linearly interpolated 
all the products to a one-degree grid except for SSS. 

The OHC ‘ground truth’ was derived from Roemmich and Gilson [44] gridded 
Argo product, which consists of 27 standard levels of 0-2000 m. The variables include 
pressure, temperature, and salinity. Dynamic heights were also provided from the 
T/S profiles. It has a monthly time interval from 2005 to the present, and a spatial 
resolution is 1° x 1°. By definition, the OHC can be calculated by conducting depth 
integral of temperature T from the surface to a particular level z. 


Z 
OHC = pCp 1 Tdz (1) 
0 
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In the integration, p is the seawater density and Cp is the heat capacity. Constant 
values of 1025 kg - m~? and 3850 J - kg~! - KT! were applied. OHC300, OHC700, 
OHC1500, and OHC2000 refers to the OHC of top 300, 700, 1500, and 2000m, 
respectively, where integration is done. The reference of OHC, with regard to the 
climatological mean of 2005-2015, was then computed and removed to provide OHC 
anomalies. Hereafter, we report OHC anomaly unless otherwise indicated. 

Other near-global OHC products will be compared with the OPEN product. These 
data sets are: National Centers for Environmental Information (NCED) data by [35], 
Institute of Atmospheric Physics (LAP) data by [13], EN4 from the Met Office of 
United Kingdom by [22], empirical DORS-based ARMOR3D data by [23], and 
numerical reanalysis GLORYS2V4. Among these data products, NCEI, IAP, and 
EN4 are all optimal interpolated (mapped) one-degree products from a common col- 
lection of discrete station and profiling data. The source of the in-situ data includes 
Argo profilers, conductivity-temperature-depth (CTD), and expendable bathyther- 
mograph (XBT). The ARMOR3D and GLORY2V4 both have a 0.25-degree resolu- 
tion. Because we only use basin OHC summations, the different resolution is not an 
issue. 


3 Method 


3.1 Neural Network 


The NN with a total of o layers (h as hidden layers) applied in this chapter can be 
generally formularized as: 


features 
Neurons in input layer : hy = fi (x; 01) = 01 (» + > ws] (2) 


Neurons in hidden layer(s) : hz = fə (h1; 62) = o2 | b2 + 5 woh, j (3) 
j 


Neuron in output layer : } = fo (ho-1; 9,) (4) 


For a regression problem, one may express mathematically an NN as an approx- 
imation function ŷ = f(x; 0) from the inputs x to the OHC yj with parameters 0. 0 
include weights w, biases b, and activation functions o for each neuron in the hidden 
layer. 

Generally, in a network, one input layer, one or more hidden layers, and one output 
layer are essential. Layers inter-connects each other in a manner of stacks. The input 
layer collects input features and hence has the same number of neurons. Each neuron 
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in each layer computes the weighted average from its previous layer’s outputs. The 
neuron then computes the nonlinear outputs with its activation function. And the 
next layer receives the output results from its previous layer. The number of neurons 
is often described as the width, with the number of hidden layers as the depth, i.e., a 
deep NN has more hidden layers. 

Such an NN is essentially an optimization problem to find the parameters 0 leading 
to the minimized cost function J, which is the mean squared error. This can be 


formularized as: And 
arggmin J(@) = 5 o- (5) 


(x,y)eTr 


In the equation, 7r refers to the training set with to N samples in a training set. 
We applied a Bayesian regularization for the NN, following our previous study of 
[36]. Our experience suggests that by smoothing the cost function J, the Bayesian 
regularization approach can efficiently avoid overfitting [19]. This trait is advanta- 
geous for temporal projection because smoothness is more likely to work effectively 
when fresh data are provided. An ensemble technique was applied. Six subsets of 
training periods were defined, that starts from 2005, 2006, 2007, 2008, 2009, and 
2010, and ends in 2013, 2014, 2015, 2016, 2017, and 2018, respectively. Except for 
the training period, all remaining data were utilized as the testing set. The uncertainty 
may be evaluated using three times standard deviations, which are distributed across 
six ensemble members. For each depth, a distinct NN was trained. Once the remote 
sensing data are provided, the OHC field can be derived. The ensemble average will 
be reported as our hindcast of OHC in the following chapters. 


3.2 Design of Experiments 


The NN relies heavily on the proper combination of sea surface variables. In AI 
field, these variables are referred to as input features. One may certainly train NN 
for any unrelated input-output data, yet this often results in overfitted NN. It is 
envisaged that an NN model can successfully extrapolate to unknown data provided 
there is a clear input-output relationship that the NN can learn. Furthermore, in 
the practice of optimization, choosing the greatest feature combination might be 
paradoxical at times [46]. As a result, features are frequently chosen haphazardly 
in a process known as feature engineering. The availability of historical data also 
influences feature selection. In the current study, to find the best combination of the 
features, we designed 16 experiments as shown in Table 2. For these experiments, 
we chose the OHC300 in January 2011 as the target to be hindcasted with the tuned 
NN and a data subset from the 12months of 2010 as the training data. Note that 
the conclusions here are insensitive to the data subsetting. Case A and Case R are 
included in each case. Case R uses remote sensing SSH, in addition to the surface 
SST and SSS, while Case A employs those from the (surface) Argo data of the 
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uppermost level. Both experiment series shared the same SSW data set. This is 
to test if Argo ‘surface’ data can be transited to remote sensing OHC estimation. 
In addition, we designed several experiments to evaluate the role of temporal and 
spatial information, involving day of year (DOY), longitude (LON), and latitude 
(LAT). At last, the normalized root-mean-square error (NRMSE) and determination 
coefficient (R?) were used to measure the network performance. NRMSE is the ratio 
of root-mean-square error to corresponding standard deviation. 


4 Results and Analysis 


4.1 Optimization of Feature Combinations 


Table 2 shows the R? and NRMSE values. Overall, the SSH anomaly is the lead- 
ing factor affecting OHC, followed by the SST anomaly. This can be seen from the 
retrieval accuracy for Case 1, which has already very high retrieval accuracy con- 
sidering these two features (Table 2). Cases 1A and Case 1R show that the accuracy 
is fairly good, with the retrieved OHC explaining 70% of the variance. The retrieval 
accuracy with satellite data is slightly higher. This could be because Argo’s SSH is 
actually the dynamic height integrated from temperature and salinity [44] while the 
contribution from volume changes was missing. Consistent with our previous work 
[36], it is clear that including spatiotemporal information, i.e., LON, LAT, and DOY, 
enhances the training for both data sets substantially when comparing Case 1 and 
Case 2. We suspect that including DOY improves the NN because it allowed it to 
learn the seasonal cycle, which is the most dominating signal in OHC. By comparing 
Cases 2 and Case 4, or Cases | and Case 3, we can find that SSW only improves the 
accuracy by ~1%. When comparing Cases 1 and Case 5, SSS increases the retrieval 
in Case R, but has a suppressing effect of 6% of Argo data. This is because Argo SSS 
differs to a large extent from the remote sensing SSS. When utilizing the measured 
SSS to train and the remotely sensed SSS to predict, this discrepancy resulted in con- 
siderably lower accuracy. On the other hand, Case R is generally better than Cases A. 
This is not surprising, given that SSH primarily represents the inner dynamics of the 
first baroclinic mode [9, 40], as well as the mismatch between dynamic height and 
satellite-based absolute dynamic topography. Not surprisingly, directly using remote 
sensing data for the training is a better way. Case 8R had the highest accuracy suc- 
cessfully captured 80% variabilities; nevertheless, because SSS is only available for 
recent years, the feature combination in Case 4R is chosen as the optimized network 
features. 
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Table 2 Design of experiments and corresponding results for testing OHC300* 


Experiment? Input Features R? (Case A/Case R) NRMSE (Case A/Case 
R) 


0.69/0.71 0.39/0.38 
0.79/0.80 0.36/0.34 


Case 1A, Case IR SSH SST 


Case 2A, Case 2R SSH SST DOY LON 
LAT 


Case 3A, Case 3R SSH SST SSW 


Case 4A, Case 4R SSH SST SSW DOY 
LON LAT 


Case 5A, Case 5R SSH SST SSS 
Case 6A, Case 6R SSH SST SSS DOY 


LON LAT 
Case 7A, Case 7R SSH SST SSS SSW 
Case 8A, Case 8R SSH SST SSS SSW 


DOY LON LAT 


* These values were achieved after training with 2010 data (12 months), while the testing was 
performed with January 2011 data. Noting that all these features are anomalies 

> Case R indicates that SSH and SST are from remote sensing data. Case R indicates those are from 
the surface record of Argo. The underlines indicate variables are different in Case A and Case R. 
Notice that these differences are merely in the training; in the testing, all are from remote sensing 
products 


0.70/0.72 0.38/0.38 
0.79/0.80 0.36/0.34 


0.64/0.71 0.40/0.39 
0.67/0.79 0.49/0.35 


0.64/0.72 0.41/0.38 
0.71/0.81 0.44/0.33 


4.2 Deep, or Shallow—That Is the Question 


One comment perspective is a deep NN has a stronger capability to regress the com- 
plex hidden relationship between input and output features. Theoretically, the univer- 
sal approximation theorem demonstrated that a one-hidden-layer NN with sufficient 
neurons can approximate any continuous function [24]. To confirm this concept in 
retrieving the OHC, the optimal hyperparameters will be discovered using a grid- 
search method. We designed several experiments in which the NN was deepened 
from two hidden layers to six. The performance of networks with different hyperpa- 
rameters is examined using a subset of data (Fig. 1). 

As Fig. 1 demonstrates, as the neuron number increases, the retrieving accuracy 
of two-/three-layer networks first improves, then declines. Generally, three-layer 
networks have steeper declines, i.e., being more prone to overfitting. This means that 
keeping a basic shallow network structure is better for the current issue. We also 
observe that increasing the complexity of NN reduces linear trends. The following 
part will deal with the global and basin-wide warming trends of OHC. These results 
agree with our previous application of NN for subsurface temperature estimation 
[36]. In summary, adding more hidden layers to a network can improve its capacity 
to fit a complicated input-to-output mapping function. It might, however, raise the 
probability of overfitting and make the training more difficult. 

The choice of activation functions is also influencing (Fig. 1). For two-layer net- 
works, the combination of ReLU and sigmoid functions is not as good as the sigmoid 
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Fig. 1 Determination coefficients as a function of the number of neurons for different NN structures. 
Different activation function combinations and hidden-layer numbers are represented by different 
line colors. The training data is OHC300 of 2005-2013, while testing data is those of 2017. The 
configuration of Case 4 was adopted. Sig means the activation function of tangential sigmoid 


function alone. The ReLU activation, on the other hand, outperforms all three-layer 
networks. Furthermore, the ReLU activation function is predicted to be more efficient 
than the sigmoid function in terms of computation, but this advantage is negligible 
for shallow networks. This result shows that for a shallow NN, the nonlinear sigmoid 
function is a better choice, emphasizing the above-mentioned universal approxima- 
tion theorem [24]. The optimum NN design was determined by these studies to be a 
three-layer NN with three neurons and a combined sigmoid with ReLU activation. 
This architecture will be used to report findings by default in the following text. 


4.3 Data Reconstruction 


We used the ensemble approach to train the model using data from 2005 to 2018. 
We further hindcasted the data from 1993, the earliest year with global altimetry 
coverage, to the year 2020. OPEN OHC data are compared to six datasets, i.e., 
NCEI, EN4, IAP, ARMOR3D, and GLORYS2V4, with an emphasis on interannual 
variabilities and decadal trends. These data sets are summarized in Table 1. 
Figure 2 presents OHC300, OHC700, and OHC2000 from OPEN and Argo for 
January 2011 (as Table 2). The NRMSE values were 0.36, 0.34, and 0.37 for the 
three depth integrals, while the retrieval R? is 0.80, 0.82, and 0.80, respectively. 
Across different depths, the accuracy changes are small, suggesting the robustness 
of NN networks. The spatial distribution hindcasted by OPEN closely agrees with 
the Argo OHC. The spatial distribution of the OHC is dominated by the ENSO 
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Fig. 2 The ocean heat content (OHC) in Joule for OPEN-hindcast and Argo data in January 2011 
for (top) 0-300 m; (middle) 0-700 m; and (bottom) 0-2000 m. The OHC reference is from 2005 to 
2015 


fluctuation. In the tropical Indo-Pacific waters, the OHC has high values for all 
depths; in the eastern tropical Pacific, the OHC is lower. This pattern suggests a 
La Niña state, consistent with a multivariate Niño index of -1.83. In the southern 
hemisphere, the meandering of the Agulhas retroflection is discernable, showing 
alternating warming and cooling patterns [6]. For different depths, the OHC patterns 
are consistent; however, the magnitude is different, which gradually increases with 
depth. The most major difference is between OHC300 and OHC2000. Significant 
OHC changes can be found in the Pacific and Indian seas for OHC300, while changes 
can be found in all basins for OHC2000. 

For hindcasting 1993-2004, by using IAP data as the true value, Fig. 3a shows the 
pattern correlation of the hindcasted OPEN OHC, while the temporal correlation and 
errors are displayed in Fig. 3b—d. The total R? was higher than 0.98, with an NRMSE 
of 12%. The R? and NRMSE interannual fluctuations are quite minimal, indicating a 
consistent performance. For long-term temporal extrapolation, this is preferable. In 
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Fig. 3 Comparing OPEN OHC300 with the IAP data set. a Pattern RMSE and NRMSE showing 
as time series. Spatial maps of temporal b RMSE, c determination correlation, and d NRMSE 


1997, there is an extremely high error, which is most likely due to the strong ENSO 
signature of this year. ENSO might disturb the ocean surface, causing the network 
to deviate from its learned association. 

At the extension of the western boundary current system, OPEN’s error is rela- 
tively higher, as well as in the two zonal bands cross the subtropical Pacific Ocean 
in the northern and southern hemispheres at ~25°, and in the Agulhas retroflection 
region. All these systems have nonlinear circulation and complex dynamics. In other 
regions, OPEN has a high correlation and low RMSE in terms of site-wise OHC time 
series but also presents heterogeneous structures. Overall, the hindcast of OHC300 
in the global ocean presents ~10% error with respect to the spatiotemporal standard 
deviation of OHC300. 

Table 3 summarizes the statistic matrix between OPEN OHC and other products. 
The R? are all greater than 0.988 (OHC300: 0.993; OHC700: 0.988; OHC1500: 
0.988; OHC2000: 0.989) when compared to the Argo OHC over the training period 
(2005-2018), whereas the NRMSE values are all less than 11% (OHC300: 0.09, 
OHC700: 0.109, OHC1500: 0.111, and OHC2000: 0.106). The best agreement 
between OPEN and EN4 products can be found, while differs from IAP (Table 3). 
In summary, OPEN OHC can be reliably reconstructed to the pre-Argo period since 
the overall accuracy is high. 

We further compare the global OHC300 from all the products shown in Fig. 4, 
and the corresponding linear trends for two distinct time periods of 1993-2010 and 
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Table 3 Accuracy of OPEN compared with other data sets 


Matrix Argo IAP EN4 ARMOR3D)| GLORYS2V4 

OHC300 |R? 0.993 0.984 0.990 0.991 0.988 
RMSE 0.841 1.237 0.964 0.907 1.062 
(x10!9 J) 
NRMSE_ | 8.6 12.0 8.2 8.5 10.1 
(%) 

OHC700 |R? 0.988 0.971 0.986 0.987 0.982 
RMSE 1.642 2.542 1.767 1.713 2.009 
(x10! J) 
NRMSE | 10.9 15.8 9.2 9.6 11.9 
(%) 

OHC1500 |R? 0.988 0.958 0.985 0.984 0.975 
RMSE 2.256 4.111 2.472 2.563 3.142 
(x10!9 J) 
NRMSE | 11.1 19.1 9.2 10.8 13.5 
(%) 

OHC2000 |R? 0.989 0.953 0.985 0.982 0.972 
RMSE 2.333 4.755 2.643 2.957 3.645 
(x10!9 J) 
NRMSE __ | 10.6 20.3 8.9 11.2 14.2 
(%) 

GOHC300 (2J) 

100 


— ENS 
-50 —GLORYS2V4 


1995 2000 2005 2010 2015 2020 


Fig. 4 12-month moving averaged global OHC300 (unit: ZJ, i.e., x 107! J) referred to the 2005- 
2014 period. The thick black line with gray envelopes is the ensemble average and three standard 
deviations for six OPEN ensemble members 
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Table 4 Warming rates? of the global ocean OHC at different depths in x 1077 J/decade 


Depth (m) | EN4 GLORYS2V4 | ARMOR3D) IAP NCEI OPEN 
OHC trends 

0-300 4.86/3.93> | 6.95/6.12 7.45/5.68 | 4.62/4.18 | 3.78/3.83 | 3.71/3.95 

0-700 7.86/7.11 12.41/9.88 13.15/8.71 | 6.93/6.68 | 5.81/5.70 | 7.92/8.16 

0-1500 10.46/10.31 | 15.55/13.55 | 18.31/12.92 | 9.19/9.07 | -° 9.78/10.63 

0-2000 10.98/11.20 | 16.05/14.61 | 18.83/13.41 | 9.79/9.77 |- 10.10/11.21 


à These trends were computed from the 12-month moving mean of the global integration of each 
product 

> The two values are for 1993-2010 and 1998-2015 periods 

e NCEI (Levitus) data only available for upper 700m 


1998-2015. In the second period, the surface warming hiatus occurred. From the 
trends, it is reflected the global ocean’s ongoing warming. For instance, in 2018 and 
2019, record high were reached in the OHC [15, 16]. Interannual variabilities such 
as the ENSO fingerprinted the OHC, showing an abrupt high in 1997 and 1998. 
This signal is less visible for deeper OHC (0-700, 0-1500, and 0-2000 m), but more 
so for the upper OHC300. Because the OHC300 is more sensitive to the surface 
thermal forcings, the ENSO signature is more prominent. In Table 4, it is noticeable 
that OPEN has a higher OHC warming trend than IAP, while that of EN4 is very 
close to the latter. Since the two data sets were both from mapping techniques from 
a similar database of in-situ observation, this similarity is not surprising, especially 
for OHC300 and less so for the deeper OHC. In these depths, two statistic-based 
products (GLORYS2V4 and ARMOR3D) present even larger inconsistency and 
stronger trends. Summarizing across all the products in Fig. 4, our NN-based OPEN 
product agrees well with other products, falling within the range of all data sets. 
Similarly for OPEN and ARMOR3D, the OHC300 presents a high bias after the 
year 2015, which is likely due to the same source of remote sensing data as the 
major inputs of estimation. Further improvements can be achieved by using more 
sophisticated AI approaches, which are ongoing efforts to predict OHC by the use of 
time sequence learning and spatial autoencoding-decoding structures. Given the large 
uncertainties among various estimations and the core role of OHC in understanding 
ocean warming and heat transfers in the Earth system, the importance of accurate 
quantifying the OHC is further emphasized. 

For different ocean basins, the OHC variabilities and trends are shown in Fig. 5 
and Table 5. The representative pattern of linear trends of IAP and OPEN is shown in 
Fig. 6. For all the major basins of oceans, there are consistent warming trends exist, 
which reflects the overall ocean warming by the anthropogenic forcing. Consistent 
with the findings of [33], during the two time periods, steadily highest OHC increase 
can be found in the Indo-Pacific basins and the warm pool area. This highlighted the 
Indo-Pacific role driving the recent global warming hiatus. In contrast, the Southern 
Ocean illustrated the lowest warming rates; this estimation was accompanied by large 
uncertainties, which can be attributed to the low Argo coverage therein. From 1993 to 
2010, IAP and OPEN both present a basin-wide dipole warming and cooling pattern 
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Fig. 5 12-month moving averaged OHC300 (unit: ZJ, i.e., x10?! J) for four major ocean basins. 
Because Argo data has a shorter temporal coverage so the reference here is 2005-2014. The thick 
black line with gray envelopes is the ensemble average and three standard deviations for six OPEN 
ensemble members 


Table 5 The OHC linear trends? for different ocean basins (unit: x10? J/decade) 


Basin EN4 GLORYS2V4 | ARMOR3D |IAP OPEN 
OHC trends 

Atlantic ocean | 1.27/0.62> 2.24/1.25 1.93/0.88 1.27/0.66 1.47/1.28 

Pacific ocean 2.29/1.33 3.52/3.12 3.49/1.93 1.88/1.57 2.57/2.45 

Indian ocean 0.80/1.36 1.26/1.75 1.43/1.50 0.95/1.48 1.17/1.36 

Southern —0.15/0.12 —0.25/—0.15 | 0.16/0.36 0.01/0.20 0.14/0.30 

ocean 


à These trends were computed from the 12-month moving mean of the regional integration of each 
product 
> The two values are for 1993-2010 and 1998-2015 periods 


in the Pacific Ocean, with positive trends in the western part and negative trends in the 
east (Fig. 6). The structure mimics the Pacific Decadal Oscillation negative pattern, 
which was supported by the transition from positive to negative phase reported in 
literature [37]. For the later period, both the IAP and OPEN show bulk warming 
Indian Ocean, but less homogeneous for other basins. These consistencies prove the 
capability of OPEN data to reflect OHC trends in both Argo and pre-Argo eras. 

We now focus on the inconsistency. Compared with the majority of data sets, 
OPEN differs to the largest degree for the Pacific. Because the ENSO’s signature is 
exaggerated in the Pacific and lower in the other oceans. For OPEN, a significant 
jump after 2015 can be unexpectedly seen for the Pacific OHC, which is not found 
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Fig. 6 Linear trends for (left) IAP and (right) OPEN OHC300. The upper row shows those of 
1993-2010 and the lower row is for 1998 to 2015. Zeros are depicted with black lines 


in the global OHC (Fig. 4). Further speculation shows that the different references 
contributed to approximately half of this jump, i.e., OPEN has a lower reference 
compared to ARMOR3D and GLORYS2V4. On the other hand, OPEN OHC300 
presents a minimal envelope of uncertainty in the Pacific basin (Fig. 5), which implies 
that the jump is not due to random errors but contains mostly systematic biases. To 
reduce the error, one option is to utilize a more sophisticated deepened NN, such as the 
deep convolutional NN that will be discussed in the following chapters of this book, to 
extract the complicated link between surface variables and OHC. Alternatively, one 
can also adopt the strategy to use the clustering technique to subset the global ocean 
into distinct thermal provinces, each can be represented by a simple but different 
surface-subsurface relationship and thus better estimated by NN. This strategy has 
been shown viable in our previous effort [36]. These will be tested in future research. 

We notice that IAP has a reference that is about 10% higher than that of Argo 
and OPEN. To further examine this mismatch, we show the non-anomaly OHC300 
from these three products, which is one particular snapshot as an example (Fig. 7). 
The IAP product is significantly larger, i.e., warm bias, than OPEN and Argo, which 
is especially noticeable in the Pacific Ocean’s subtropical gyres, despite the fact that 
the three products show relatively similar patterns. This is very likely due to the 
errors of the XBT correction scheme of IAP. Compared with more accurate CTD 
measurements, the XBT measurements have a well-documented warm bias, despite 
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Fig. 7 Non-anomaly OHC300 (non-anomaly, unit: J) for a Argo, b IAP, and c OPEN showing a 
particular snapshot at April average of 2018 


Fig. 8 The PCs (left), EOFs of OPEN (middle), and EOF of IAP (right) from EOF analysis 
of OHC300, with the corresponding percentage of explanation shown in the title. The explained 
percentage of each mode is listed above the corresponding EOFs. Before EOF analysis, linear trends 
and high-frequency signals (higher than 12 months) were removed 


the advanced time-varying correction recently developed [12, 34], which accounts 
for ~0.5°C warm bias. Yet, noting that this error is systematic, there are limited 
signatures in the decadal trends, but this highlights the value of having additional 
independent OHC datasets. 

To further demonstrate the OHC300’s spatiotemporal variability, we apply the 
empirical orthogonal function (EOF) analysis to OPEN and IAP (Fig. 8). The EOF 
analysis was conducted after linear trends and seasonal variation (by a 12-month 
lowpass filter) were removed. Those spatial EOFs and temporal PCs show high 
agreements between OPEN and IAP data sets (Fig. 8). For instance, the first mode 
accounts for 41.2% variability in IAP and 40.7% in OPEN, which is very close to each 
other. Only some small differences can be found in the corresponding PC1 (Fig. 8a). 
The EOF both have a tropical Pacific dipole higher in the warm pool and lower in 
the western part. In the extratropic, the difference is still small. Other modes present 
some visibly larger differences, but considering the smaller percentage (<12%) of 
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these modes, the contribution to the OHC difference is small. The same conclusions 
will be drawn by analyzing OPEN and other products, further demonstrating the 
validity of OPEN. 


5 Summary and Conclusions 


In this chapter, we describes the AI technique of DORS and its application for 
studying climate change. An NN approach was developed to estimate OHC from 
remote sensing data sets, yielding a new ocean heat content estimation, which was 
termed Ocean Projection and Extension neural Network (OPEN) product [47]. By 
using the 1° x 1° gridded Argo OHC data as the true values, and taking advantage 
of remote sensing products of SSH, SST, and SSW with near-global coverage and 
higher spatiotemporal resolution, we trained four NNs, each for estimating the OHC 
from the surface to 300, 700, 1500, and 2000 m depth. The NNs were trained with 
the 2005-2018 data in a way that enables the temporal extrapolation of OHC. By 
testing a variety of architecture of NN and feature combinations, the NN was opti- 
mized. Generally, a simple shallow NN was favorable for temporal extrapolation. The 
final choice for NN architecture had three hidden layers, each with three neurons. 
In this way, the four-depth OPEN OHC product was extended to the 1993 period 
covering the pre-Argo era, with a very high accuracy of R? > 0.95 and NRMSE < 
20%. We also estimated the uncertainty of OHC by using an ensemble technique, 
which demonstrated that OPEN also had low uncertainties from the NN technique. 
Comparisons of OPEN against other widely applied OHC data sets showed the good 
performance of OPEN in terms of trends and variabilities. 

Various contributions have emphasized the need for more trustworthy OHC prod- 
ucts for the sake of understanding the Earth’s climate, e.g., [42, 54]. As we men- 
tioned before, all estimations are subjected to different sources of uncertainties. In- 
situ mapping-based products (IAP, EN4, and NCEI) have inconsistent observation 
records and uncertainties in mapping schemes. Numerical models (GLORYS2V4) 
may incorporate imperfect representations of physics. Despite the favorable perfor- 
mance of the NN-based OPEN product, it has limitations. It was trained from gridded 
Argo data. Although the gridded Argo product is often treated as observation, it is 
subjected to its own mapping and instrumental errors. For instance, [30] has found a 
larger errors of such product in western boundary current systems, where nonlinear 
dynamics are characterized. The unevenly distributed Argo profiles also contributed 
to the spatial errors. 

For the oceanography community, one haunting skepticism to AI technique is that: 
what can these techniques do to solve real-world oceanography problems? To date, 
many applications of AI oceanography are still very preliminary, ‘in their infancy’ 
[55], far away from product-level outcomes. This chapter shows a promising appli- 
cation of AI techniques in climatic and ocean sciences, in addition to the currently 
available DORS studies. Presumably, the application of OPEN will also serve as 
a base for future AI studies. Several future directions of AI application concern- 
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ing OHC and its climatic effects are: (1) extending the global OHC product to a 
longer time span, favorably covering several quick-warming and particularly surface 
warming ‘hiatus’ periods to understand the phenomenology [53]; (2) generating a 
downscale OHC product with higher spatial/temporal resolution; (3) developing AI 
method that digging into multiple datasets and digesting physics laws; and (4) pro- 
jecting future OHC. These are all playgrounds where the AI Oceanography approach 
can unleash its potential. 
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1 Introduction 


Tropical cyclone (TC), as one of the most violent phenomena of air-sea interac- 
tion, often brings disastrous storm surges and flooding and causes significant dam- 
age to human life, agriculture, forestry, fisheries, and infrastructure. Therefore, the 
knowledge of TC track, intensity, structure, and evolution is required to guide severe 
weather forecasting and risk assessment. Generally, the formation of TCs needs 
the support of dynamic environmental conditions and thermodynamically favorable 
environmental conditions [19]. Because only a small percentage of convective dis- 
turbances are developing into TCs, it is still challenging to predict the TC formation 
accurately. 

Since the Dvorak Technique (DT) was proposed and developed [8, 21, 28], it 
has been wildly used in TCs intensity estimation [14, 22, 25, 29] and TC formation 
prediction [6, 20, 32, 33]. However, DT is based on the infrared technique, whose 
observations may be obscured by significant convection or cirrus clouds. In contrast, 
microwave radiation images can capture the strong convective areas and cloud orga- 
nization. Therefore, it is potential to predict the formation of TCs with microwave 
remote sensing data. 

With the advancements in high-performance computing, machine learning meth- 
ods based on big datasets are wildly used in tropical cyclogenesis detection of TC 
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formation. Based on the decision tree method, a series of classification rules are 
constructed to predict future tropical cyclone (TC) genesis events, and the overall 
prediction accuracy is 81.72% [35]. Using the dataset established with WindSat wind 
products, [23] established a classification model for tropical cyclogenesis detection. 
The validation shows that the model produced a positive detection rate of approx- 
imately 95.3% and a false alarm rate of 28.5%. This study confirmed the poten- 
tial of microwave remote sensing observation in detecting typhoon formation [23]. 
Recently, based on the internal structure information of tropical cyclones obtained 
by satellite remote sensing, [27] effectively improved the prediction accuracy of the 
rapid enhancement process of tropical cyclones and reduced the false alarm rate 
using the machine learning method. Moreover, [13] compared different machine 
learning algorithms’ TC formation detection performance. Their results prove that 
the machine learning method performs better than the traditional linear discriminant 
analysis. 

However, with the continuous accumulation of remote sensing data, traditional 
machine learning methods cannot deal with massive data perfectly. Fortunately, pow- 
erful deep learning has demonstrated its more significant superiority over traditional 
physical or statistical-based algorithms for image information extraction [18]. In 
ocean remote sensing applications, the deep learning methods are used in hurricane 
intensity estimation [7, 24], sea ice concentration prediction [5, 9, 11], sea sur- 
face temperature estimation [1, 30] and other fields [10, 26, 37]. A deep learning 
approach has been proposed to identify tropical cyclones (TCs) and their precursors 
based on twenty-year simulated outgoing longwave radiation (OLR) calculated with 
a cloud-resolving global atmospheric simulation [19]. In the Northwest Pacific in the 
period from July to November, the probability of detection (POD) of the model is 
79.9-89.1%, and the false alarm ratio (FAR) is 32.8-53.4%. In addition, this study 
reveals that the detection performance is correlated with the amount of training data 
and TC lifetimes. 

Although deep learning is increasingly widely used in ocean remote sensing [18], 
the disadvantages of deep learning are also evident. It requires high computing power 
and a long training time. Moreover, most deep learning models do not have incre- 
mental learning capacity, which means the model needs to be retrained if updating 
the dataset. In ocean remote sensing, the satellite-based data increases every day, 
so the size of datasets is expected to expand further to improve the generalization 
ability and identification accuracy of models. Therefore, the defect of no incremental 
learning is not friendly to storage resources or model update times. Fortunately, the 
capacity of incremental learning of the Broad Learning System (BLS) [3] makes it 
have the potential to be applied in the field of ocean remote sensing. Meanwhile, 
the BLS is a time-cost-friendly learning strategy due to its flatted network. These 
advantages can compensate for the disadvantage of its accuracy compared with deep 
learning, so it has been widely used as soon as it is proposed. Recently, it has suc- 
cessfully been applied in seismic attenuation modeling [16], model updating [17], 
hyperspectral imagery classification [34], and crack detection [36]. 

In this chapter, we proposed a tropical cyclogenesis detection algorithm based 
on Special Sensor Microwave Imager (SSM/I) brightness temperature data. The 
proposed model based on BLS has three unique features: low hardware requirements, 
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fast computation speed, and incremental learning ability. In Sect. 2, the dataset used in 
this study is presented. In Sect. 3, the details of BLS are introduced. The experimental 
results are shown in Sect. 4, and the conclusion is given in Sect. 5. 


2 Data Description 


The dataset used in this chapter is extracted from the brightness temperature (TB) 
observations acquired by SSM/I. This series of instruments is carried onboard 
Defense Meteorological Satellite Program (DMSP) near-polar orbiting satellites. The 
SSM/T is a conically scanning sensor that measures the natural microwave emission 
from the Earth in the spectral band from 19 GHz to 85 GHz with different polarization 
(See Table 1). The parameters derived from these radiometer observations include 
surface wind speed, atmospheric water vapor, cloud liquid water, and rain rate [31]. 
Comparing the feature of TB images in different channels/polarizations, the 37 GHz 
H-polarization (37H) channel is selected due to its clear description of the features 
of disturbances and tropical cyclones. 

To collect the sample images covered TCs or non-developed disturbances (non- 
TC), the TC best tracks and tropical cloud cluster (TCC) tracks during 2005-2009 are 
used as auxiliary data. This information can be obtained from the International Best 
Track Archive for Climate Stewardship dataset IBTrACS) [15] and Global Tropical 
Cloud Cluster dataset [12], respectively. The time resolution of these two datasets is 
three hours. Note that not all the best track records in the TC evolution period are 
used, but those during the TC formation period are selected. Specifically, the time 
when the TC maximum wind speed reaches 25 knots for the first time is defined 
as the starting time. Then, the 72h after this time is defined as the TC formation 
period [23]. For the TCC tracks, only the records that have not developed into TCs 
are selected. The preprocessing steps for extracting the TC and non-TC images are 
described as follow: 


(1) For each TC/non-TC track record, determine the matching SSM/I TB data within 
the absolute time difference of 1.5h. 

(2) Take the track record as the image center position, and extract the sub-images 
with the size of 8° x 8° from the SSM/I TB observations. 

(3) The sub-images with more than 60% non-empty pixels are retained as qualified 
samples (see Fig. 1); otherwise, the invalid data will be excluded (see Fig. 2). 


Table 1 Channel characteristics of SSM/I 


Band/GHz Polarization Spatial resolution/km x km 
9.35 V/H 69 x 43 
23.235 V 50 x 40 
37.0 V/H 37 x 28 
85.5 V/H 15 x 13 


* H-horizontal polarization, V-vertical polarization 
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Fig. 1 Qualified samples: a valid TC samples and b valid non-TC samples 
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Fig. 2. Unqualified samples: a invalid TC samples and b invalid non-TC samples 
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Following the above steps, 880 TC samples and 6268 non-TC samples were 
obtained from the SSM/I observations in 2005-2009. Due to the significant number 
difference between the two samples, only 2506 non-TC samples in 2005-2006 and 
880 TC samples in 2005-2009 were selected to form the final dataset. Each sample 
is in the size of 224 x 224 pixels with RGB channels. Finally, These two datasets 
are randomly divided into the training set and the testing set in the ratio of 4:1, 
respectively. 


3 Broad Learning System for Tropical Cyclogenesis 
Detection 


Once the dataset is established, the tropical cyclogenesis detection can be executed 
with the broad learning system. In contrast to deep learning methods, the BLS pro- 
vides a time-cost-friendly learning strategy due to its flatted network. The main 
structure of BLS consists of the input layer, node layer, and output layer. Specifi- 
cally, the node layer includes the feature nodes and enhancement nodes. Generally, 
the input data is mapped to feature nodes with random weights. Then, the feature 
nodes are further mapped to enhancements with new random weights. Finally, the 
final weights of BLS can be trained by estimating the output data with these fea- 
ture nodes and enhancement nodes. Figure 3 shows the architecture of this study, the 
definition of variables are: X is the input data, F is the feature node, and E is the 
enhancement node. Y is the respective classification labels of the input data X. The 
details of BLS are presented as follows. 


3.1 Broad Learning Model 


Assume that the input data is X, so the feature vector F mapped with random weight 
can be described as 
Fi = @(XW., + be), i = 1,...,0 (d) 


where F; is the i-th feature node, W, and be are the random weights and biases 
with the proper dimensions, respectively. Denote F” = [F),..., Fa], which is the 
concatenation of all the first n groups of mapped features. Then, the enhancement 
nodes can be given by: 

Em =& (F" Wh, + Di,,) (2) 


where Em is the m-th enhancement node, W, and bp are the random weights and 
biases with the proper dimensions, respectively. Similarly, the concatenation of all the 
first m groups of enhancement nodes are denoted as E” = [E\,..., Em]. Therefore, 
the broad model can be represented as the equation of the form 
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Broad Learning System for Tropical Cyclogenesis Detection 
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Fig. 3 The architecture of BLS 


Y =[Fi,..., Fa | E (F"Wh, + bn,),-.-§ (F" Wh, + br, ) | W” 
= Fiesty Fy | E1, -p En] W” (3) 


where W” =[F” | E]*Y are the connecting weights for the broad structure to 
be computed and [F” | E”"]* is the pseudo-inverse of [F” | E™]. In a flatted net- 
work, pseudo-inverse can be considered a very convenient approach to solving the 
output-layer weights of a neural network. However, a straightforward solution is too 
expensive, especially when the training samples and input patterns suffer from high 
volume, high velocity, and/or high variety [4]. Under this situation of the expensive 
cost for directly computing the pseudo-inverse, the solution can be approximated by 
ridge regression: 


=i 
At = [F" | BE") = AI +[F" | EMF" | E")") LF" | E" (4) 
where A is the regularization parameter. Finally, the model weights are given by 


W=AtY (5) 


154 S. Wang and X. Yang 


During the BLS computation, it should be noted that the number of enhancement 
nodes is a hyperparameter (N3), and the number of feature nodes is the combination 
of two hyperparameters: the number of feature windows (NV) and the number of 
nodes in each feature window (N2). Here, the Bayesian optimization method is used 
to find the optimal model hyperparameters, and it can be easily executed with the 
Hyperopt package [2]. 


3.2 Incremental Learning of BLS 


When a deep learning model works not well, the number of convolutional kernels 
or the number of convolutional layers will increase. This will lead to expensive 
computation and long time costing. However, BLS usually uses incremental learning 
to address the low model accuracy caused by insufficient mapping nodes. Generally, 
the incremental learning part can improve the model performance. There are two 
ways to expand the broad structure: (1) increment of enhancement nodes and feature 
nodes, and (2) adding input data. 


3.2.1 Increment of the Feature Nodes and Enhancement Nodes 


Assume that the initial BLS has n feature nodes and m enhancement nodes that is 
A=[F" | E”’]. In the adding process, the (n+/)-th feature node is given by: 


Fa+ı = $ (XWo F bea) (6) 


So that the corresponding enhancement node to this feature node is given by: 


Ex, = [é (Fai Wex, + bex,) sero’ E (Fai Wex,, + bex,,)] (7) 


Then, there are additional p enhancement nodes added to the BLS structure, and 
the (m+1)-th enhancement node is given by 


Ems = E (F" Ways + ings) (8) 
Therefore, the final node layer matrix is combined as 
A =[A| Fast | Bex, | Emi] (9) 
Then, the pseudo-inverse of A’ is computed with 


At — pP (10) 


ay =| BT 
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where D = (A)*[ Fat | Be, Emil, 


+ 
wal? ae (11) 
(1+ D™D)'BTAt C=0 
and C = [ Fn41 | Eex,, | Em+1] — AD. Finally, the new weights are 
W' = W-DB'Y (12) 
~ BTY 


As seen in Eq. 12, the updated weights consist of the initial and the new parts. 
There is no need to re-calculate the pseudo-inverse for the whole nodes but only 
compute the added nodes. 


3.2.2 Increment of the Input Data 


The increment of feature nodes and enhancement nodes mentioned above are for 
the fixed dataset. However, in most learning models, the input dataset is the core 
factor influencing the prediction accuracy. As for deep learning, once some new 
data is added to the training dataset, the existing model needs to be retrained. This 
is time-consuming and reduces the timeliness of model updating and application. 
Fortunately, there is no need to retrain the whole model for BLS after adding input 
data. The BLS will train only the added ones. 

Denote X, as the new inputs, the respective increment of mapped feature nodes 
and enhancement nodes are: 


F” = [6 (Xa We, + bei); <- -5 0 (Xa We, + be, )] (13) 


Ep = [E (Fe Wa, + bmn) > -5 E (Fe Way + Bin) (14) 
where the W,., Wh, and be,, by, are randomly generated during the initial BLS. Hence, 


the updating matrix is 
' A 
A= | iz (15) 


where AT = [F” | Ex]. The associated updating pseudo-inverse could be deduced 
as follows: 


Wi = W + (Ya — AT W)B (16) 
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where Y, are the respective labels of additional X,. Similar to the two increment 
processes mentioned above, only the pseudo-inverse associated with new inputs is 
calculated. It greatly improves the update speed of the BLS model. 


4 Results 


Before training the model, one should determine the hyperparameters first. For the 
basic BLS, the hyperparameters include the number of feature windows (N1), the 
number of nodes in each feature window (N2), and the number of enhancement nodes 
(N3). To compare the training time and model accuracy, the classical ResNet50 model 
is also applied to the same dataset. For this ResNet50 model, the initial learning rate 
is 10°, the multiplicative factor of learning rate decay is 0.5, the batch size is 16, 
and the number of epochs is 20. Before training the ResNet50 network, all input 
images were resized to 224 x 224. To fairly compare the training time of these two 
networks, the training tasks were operated on a computer with an Intel(R) Core(TM) 
i7-8700K CPU @ 3.70GHz and 64GB RAM. 


4.1 Basic BLS Results 


Using the Hyperopt, the hyperparameters are optimized as: N; = 5, Nz = 24, N3 = 
2332. The training and testing accuracies and training time of these two methods are 
shown in Table 2. The testing accuracy of BLS is 86.83%, which is slightly lower 
than the 91.88% of ResNet50. On the other hand, though the ResNet50 is operated 
with an accelerating GPU, the training time of 2090.45 s is still 20 times than 60.52 s 
of BLS. Therefore, BLS has obvious advantages in computational efficiency, but it 
is not as accurate as the deep learning network because it is insensitive to the image 
features. Furthermore, we compared the hit rate (HR) and false alarm rate (FAR) of 
these two models. Table 3 lists the HR and FAR for the training and testing processes. 
The testing HR and FAR are 81.14 and 11.18%, respectively. Compared to the 79.9— 
89.1% of HR and 32.8-53.4% of FAR of existing deep learning research [19], our 
results are competitive. But it should be noted that the size of the dataset used in [19] 
is 50000 TCs and 500000 non-TCs, which is significantly larger than our dataset. 


Table 2 The results for BLS and ResNet50 


Model Training accuracy (%) | Testing accuracy (%) | Training time 


BLS 99.96 86.83 60.52 (CPU) 
ResNet50 98.30 91.88 2090.45 (GPU) 


Detecting Tropical Cyclogenesis Using Broad Learning System ... 157 


Table 3 HR and FAR of BLS and results of [19] 


Prediction accuracy (%) BLS (%) Matsuoka et al. [19] (%) 
HR 81.14 79.9-89.1 
FAR 11.18 32.8-53.4 


Table 4 Prediction accuracy with different hyperparameters 


Feature nodes | Enhancement | Training time | Testing HR (%) FAR(%) 
(Ni, N2) nodes (s) accuracy (%) 

5, 24 2332 58.68 86.83 81.14 11.18 
5, 24 1000 57.59 86.24 72,57 8.98 
5, 24 1500 59.78 85.35 75.43 11.18 
5, 24 2000 58.28 84.62 74.29 11.78 
5, 24 2500 58.82 85.50 74.88 10.78 
5, 24 3000 59.68 85.50 74.86 10.78 
6, 24 3000 68.01 84.47 77.71 13.17 
7, 24 3000 73.34 86.24 78.28 10.98 
8, 24 3000 77.02 85.50 78.86 12.18 


In contrast to the massive number of hyperparameters, the BLS only has primary 
hyperparameters to influence the prediction accuracy. To know the prediction perfor- 
mance of BLS influenced by different combinations of these three hyperparameters, 
we trained the same dataset several times, and the results are listed in Table 4. The 
combination of the number of feature nodes and enhancement nodes is essential for 
the accuracy of the model. It shows that the more nodes, the longer the training 
time, but the changing trend of model testing accuracy, HR, and FAR is inconsistent. 
However, the optimized hyperparameter combination has the highest HR because 
the optimization process takes the HR as the selection standard. Therefore, It is 
indispensable to determine these parameters using the optimization algorithm. 


4.2 Incremental Learning Results 


The capacity of incremental learning is the prominent superior feature of BLS to 
most traditional deep learning models. For most deep networks, the structures are 
fixed once the training process is finished. In contrast, the BLS can be updated by 
adding new nodes or updating the dataset. There is no need to retrain the whole 
network, which significantly saves the costing time of model updating. The dataset 
size is 3386, and we set the initial size as 1386 with the adding input patterns of 400. 
First, we trained the initial model with the initial samples. Then, the incremental 
learning method was operated to add corresponding input patterns each time until 
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Table 5 Prediction accuracy and CPU time of incremental learning 


Number of inputs Training accuracy (%) | Testing accuracy (%) | Training time (s) 
707 100.00 26.33 31.53 
708-1107 27.46 61.83 28.93 
1108-1507 53.35 74.11 28.79 
1508-1907 63.14 74.11 28.43 
1908-2307 69.53 74.11 28.56 
2308-2707 74.03 74.11 29.09 


all training samples were input. During these incremental learning steps, the value 
of hyperparameters is still as Nj = 5, No = 24, N3 = 2332. The results are shown 
in Table5 and note that the testing dataset is unchanged during the incremental 
learning. The results in the table show that with the size of inputs increasing, the 
training accuracy grows. For the initial process, the small size of the dataset leads 
to unreasonable accuracies. Because the test dataset for each incremental learning is 
the same, the testing accuracy tends to be stable. 


4.3 Case Study: Hurricane Wilma (2005) 


Once the model training/testing processes are finished, the model hyperparameters 
(Ni, No, and N3) and the weights of nodes (feature nodes and enhancement nodes) 
are fixed. Based on the trained BLS, the prediction task of specific TC cases can 
be executed. Here, we select Hurricane Wilma (2005) as the study case to validate 
the effectiveness of the proposed model. Wilma was an extremely intense hurricane 
over the northwestern Caribbean Sea. It had the all-time lowest central pressure for 
an Atlantic basin hurricane. According to the statistics, twenty-three deaths have 
been directly attributed to Wilma, and the total economic losses reached 16 billion 
to 20 billion dollars. As the best tracks in Table 6 shown, Wilma developed into a 
tropical storm (TS) from tropical depression (TD) at 06:00 UTC 17 October, and 
then strengthened into a hurricane (HU) at 12:00 UTC 18 October. We planned to 
collect the samples from 18:00 UTC 15 October to 15:00 UTC 18 October during 
the data preparation. However, there were only four qualified samples were retained 
(see Fig.4). The corresponding best track records of Fig.4a—d are 00:00 UTC 17 
October, 12:00 UTC 17 October, 00:00 UTC 18 October, and 12:00 UTC 18 October, 
respectively. 

Table7 lists the prediction results of these four samples. It shows that Fig. 4b 
is incorrectly classified into a non-TC label, and the remaining three samples are 
correctly identified. Figure 4a is the first sample captured by SSM/I during Wilma’s 
formation time, and its correct classification means the proposed model can detect 
tropical cyclogenesis as early as possible. Figure4c, d are the observations during 
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Table 6 The best track information during the formation time of Wilma (2005) 
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Time (UTC) Status Longitude Latitude Wind speed 
10-15 18:00 78.50° W 17.60° N 

10-15 21:00 78.66° W 17.61° N 

10-16 00:00 78.80° W 17.60° N 

10-16 03:00 78.91° W 17.55° N 

10-16 06:00 79.00° W 17.50° N 

10-16 09:00 79.10° W 17.49° N 

10-16 12:00 79.20° W 17.50° N 

10-16 15:00 79.30° W 17.51° N 

10-16 18:00 79.40° W 17.50° N 

10-16 21:00 79.52° W 17.49° N 

10-17 00:00 17.60° W 17.40° N 

10-17 03:00 79.61° W 17.18° N 

10-17 06:00 TS 79.60° W 16.90° N 35 
10-17 09:00 TS 79.64° W 16.59° N 37 
10-17 12:00 TS 79.70° W 16.30° N 40 
10-17 15:00 TS 79.75° W 16.12° N 42 
10-17 18:00 TS 79.80° W 16.00° N 45 
10-17 21:00 TS 79.86° W 15.89° N 50 
10-18 00:00 TS 79.90° W 15.80° N 55 
10-18 03:00 TS 79.88° W 15.70° N 57 
10-18 06:00 TS 79.90° W 15.70° N 60 
10-18 09:00 TS 80.04° W 15.91° N 62 
10-18 12:00 HU 80.30° W 16.20° N 65 
10-18 15:00 HU 80.68° W 16.44° N 70 


xTD-Tropical Depression, TS-Tropical Storm, HU-Hurricane 


Wilma’s mature period, and its TC structure is relatively stable and complete. So it 
is predictable to obtain the correct results. However, the negative result of Fig. 4b 
proves that the quality of samples brings uncertainty and error to the model prediction. 
Specifically, compared with the other three samples, Fig. 4b loses nearly half of the 
TC system information, which reduces the number of effective pixels and damages 
spiral or TB distribution characteristics of TCs. This could be the main reason for 
the poor prediction. 

However, not all samples with missing information will be incorrectly identified, 
but the core TC system information loss will lead to misclassification. To verify this 
conclusion, we select the tropical storm Hilda (2009) as the test case. Figure 5 shows 
Hilda’s four qualified TC samples, and Table8 lists the corresponding prediction 
results. It shows that all the samples are correctly classified though parts of informa- 
tion lost in Fig. 5. In particular, the size of the missing part in Fig. 5a, b is similar to 
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a 2005/10/17 01:22 b 2005/10/17 13:15 


Be. TB (K) 


280 


260 
240 
220 
c 2005/10/18 01:07 d 2005/10/18 12:59 200 
180 
160 
140 
Fig. 4 The bright temperature images during the formation of Wilma (2005) 
Table 7 Prediction results for the four samples of Wilma (2005) 
Time of samples | Label Prediction label | Result Operating time 
(UTC) (s) 
0-17 01:22 1 1 True <0.01 
10-17 13:15 1 0 False <0.01 
10-18 01:07 1 1 True <0.01 
10-18 12:59 1 1 True <0.01 


x Label 1-TC, Label 0-non-TCC 


that in Fig. 4b, but there is less TC system information in the missing part. The TC 
structure and TB distribution pattern are not contaminated significantly. Therefore, 
these two samples can still be correctly identified. All in all, the proposed model can 
well detect tropical cyclogenesis with high-quality data. 
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a 2009/08/22 14:37 b 2009/08/23 02:10 

TB (K) 
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c 2009/08/23 14:02 d 2009/08/25 03:20 200 
180 

160 

140 


Fig. 5 The bright temperature images during the formation of Hilda (2009) 


Table 8 Prediction results for the four samples of Hilda (2009) 


Time of samples | Label Prediction label | Result Operating time 
(UTC) (s) 

08-22 14:37 1 1 True <0.01 

08-22 02:10 1 1 True <0.01 

08-23 14:02 1 1 True <0.01 

08-25 03:20 1 1 True <0.01 


* Label 1-TC, Label 0-non-TC 
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5 Conclusion 


In this study, a tropical cyclogenesis detection model is proposed using BLS. In 
contrast to the deep network methods, the new model is a lightweight flatted network, 
leading to lower computation and shorter training time. Meanwhile, the capacity 
of incremental learning of BLS is consistent with the continuously updated and 
accumulated remote sensing data. Adding new input data does not need to retrain 
the whole updated dataset. Based on the dataset consisting of 3386 TB images, the 
testing accuracy, HT, and FAR of BLS are 86.83%, 81.14%, and 11.18% respectively. 
This study confirms the applicability of BLS in the binary classification problem in 
ocean remote sensing. It also proves the possibility of detection of TC formation 
from satellite microwave TB data. 

Although the BLS has shown great power in the classification problem, two defects 
need to be addressed: (1) the BLS is insensitive to the features of images, which will 
lead to poor accuracy when the image features are complicated. Inspired by the 
powerful ability of convolution neural network (CNN) to capture and learn image 
features, we will add a feature extracting mode before our BLS to improve its image 
processing ability. (2) The size of the dataset is too small to support the learning 
requirements perfectly. There are three ways to expand the dataset: One is to add 
the TB data from other channels (e.g., 19H/V), the other is to obtain samples in a 
longer period, and the last is to utilize the TB observations from other microwave 
radiometers (e.g., Microwave Imager onboard FengYun series satellites). 


References 


1. Aparna SG, D’Souza S, Arjun NB (2018) Prediction of daily sea surface temperature using 
artificial neural networks. Int J Remote Sens 39(11—12):4214-4231 

2. Bergstra J, Yamins D, Cox D (2013) Making a science of model search: Hyperparameter 
optimization in hundreds of dimensions for vision architectures. In: International Conference 
on Machine Learning, PMLR, pp 115-123 

3. Chen CLP, Liu Z (2018) Broad learning system: An effective and efficient incremental learning 
system without the need for deep architecture. IEEE Trans Neural Netw Learn Syst 29(1):10- 
24. https://doi.org/10.1109/TNNLS.2017.2716952 

4. Chen CLP, Zhang CY (2014) Data-intensive applications, challenges, techniques and tech- 
nologies: A survey on Big Data. Inf Sci 275:314—347. https://doi.org/10.1016/j.ins.2014.01. 
015 

5. Chi J, Kim Hc (2017) Prediction of arctic sea ice concentration using a fully data driven deep 
neural network. Remote Sens 9(12). https://doi.org/10.3390/rs9121305 

6. Cossuth JH, Knabb RD, Brown DP, Hart RE (2013) Tropical cyclone formation guidance 
using Pregenesis Dvorak Climatology. part i: Operational forecasting and predictive potential. 
Weather Forecast 28(1):100-118. https://doi.org/10.1175/WAF-D- 12-00073.1 

7. Dawood M, Asif A, Minhas FuAA (2020) Deep-PHURIE: deep learning based hurricane 
intensity estimation from infrared satellite imagery. Neural Comput Appl 32(13):9009-9017. 
https://doi.org/10.1007/s00521-019-04410-7 


Detecting Tropical Cyclogenesis Using Broad Learning System ... 163 


8. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25, 


26. 


Dvorak VF (1975) Tropical cyclone intensity analysis and forecasting from satellite imagery. 
Mon Weather Rev 103(5):420-430 

Gao Y, Gao F, Dong J, Wang S (2019) Transferred deep learning for sea ice change detec- 
tion from synthetic aperture radar images. IEEE Geosci Remote Sens Lett 16(10):1655-1659. 
https://doi.org/10.1109/LGRS.2019.2906279 

Ham YG, Kim JH, Luo JJ (2019) Deep learning for multi-year ENSO forecasts. Nature 
573(7775):568+. https://doi.org/10.1038/s41586-019- 1559-7 


. Han Y, Gao Y, Zhang Y, Wang J, Yang S (2019) Hyperspectral sea ice image classification 


based on the spectral-spatial-joint feature with deep learning. Remote Sens 11(18). https://doi. 
org/10.3390/rs11182170 

Hennon CC, Helms CN, Knapp KR, Bowen AR (2011) An objective algorithm for detecting 
and tracking tropical cloud clusters: Implications for tropical cyclogenesis prediction. J Atmos 
Oceanic Tech 28(8):1007—1018. https://doi.org/10.1175/2010JTECHA1522.1 

Kim M, Park MS, Im J, Park S, Lee MI (2019) Machine learning approaches for detecting 
tropical cyclone formation using satellite data. Remote Sens 11(10). https://doi.org/10.3390/ 
rs11101195 

Knaff JA, Brown DP, Courtney J, Gallina GM, Beven JL II (2010) An evaluation of Dvo- 
rak technique-based tropical cyclone intensity estimates. Weather Forecast 25(5):1362—1379. 
https://doi.org/10.1175/2010WAF2222375.1 

Knapp KR, Kruk MC, Levinson DH, Diamond HJ, Neumann CJ (2010) The international 
best track archive for climate stewardship (IBTrACS) unifying tropical cyclone data. Bul Am 
Meteorol Soc 91(3):363+. https://doi.org/10.1175/2009BAMS2755.1 

Kuok SC, Yuen KV (2020) Broad learning for nonparametric spatial modeling with application 
to seismic attenuation. Comput-Aided Civil Infrastruct Eng 35(3):203-218 

Kuok SC, Yuen KV (2020) Multi-resolution broad learning for model updating using incom- 
plete modal data. Struct Control Health Monit 27(8):e2571 

Li X, Liu B, Zheng G, Ren Y, Zhang S, Liu Y, Gao L, Liu Y, Zhang B, Wang F (2020) 
Deep-learning-based information mining from ocean remote-sensing imagery. Natl Sci Rev 
7(10):1584—1605. https://doi.org/10.1093/nsr/nwaa047 

Matsuoka D, Nakano M, Sugiyama D, Uchida S (2018) Deep learning approach for detecting 
tropical cyclones and their precursors in the simulation by a cloud-resolving global nonhydro- 
static atmospheric model. Progress Earth Planet Sci 5. https://doi.org/10.1186/s40645-018- 
0245- 

Nakano M, Kubota H, Miyakawa T, Nasuno T, Satoh M (2017) Genesis of super cyclone pam 
(2015): Modulation of low-frequency large-scale circulations and the madden-julian oscillation 
by sea surface temperature anomalies. Mon Weather Rev 145(8):3143-3159. https://doi.org/ 
10.1175/MWR-D-16-0208.1 

Olander TL, Velden CS (2007) The advanced dvorak technique: Continued development of an 
objective scheme to estimate tropical cyclone intensity using geostationary infrared satellite 
imagery. Weather Forecast 22(2):287—298. https://doi.org/10.1175/WAF975.1 

Olander TL, Velden CS (2019) The advanced dvorak technique (ADT) for estimating tropical 
cyclone intensity: Update and new capabilities. Weather Forecast 34(4):905—922. https://doi. 
org/10.1175/WAF-D- 19-0007.1 

Park MS, Kim M, Lee MI, Im J, Park S (2016) Detection of tropical cyclone genesis via 
quantitative satellite ocean surface wind pattern and intensity analyses using decision trees. 
Remote Sens Environ 183:205-214. https://doi.org/10.1016/j.rse.2016.06.006 

Pradhan R, Aygun RS, Maskey M, Ramachandran R, Cecil DJ (2018) Tropical cyclone intensity 
estimation using a deep convolutional neural network. IEEE Trans Image Process 27(2):692— 
702. https://doi.org/10.1109/TIP.2017.2766358 

Rozoff CM, Velden CS, Kaplan J, Kossin JP, Wimmers AJ (2015) Improvements in the proba- 
bilistic prediction of tropical cyclone rapid intensification with passive microwave observations. 
Weather Forecast 30(4): 1016-1038. https://doi.org/10.1175/WAF-D-14-00109.1 

Scher S, Messori G (2019) Weather and climate forecasting with neural networks: using general 
circulation models (GCMs) with different complexity as a study ground. Geosci Model Dev 
12(7):2797-2809. https://doi.org/10.5194/gmd- 12-2797-2019 


164 S. Wang and X. Yang 


27. Su H, Wu L, Jiang JH, Pai R, Liu A, Zhai AJ, Tavallali P, DeMaria M (2020) Applying 
satellite observations of tropical cyclone internal structures to rapid intensification forecast 
with machine learning. Geophys Res Lett 47(17). https://doi.org/10.1029/2020GL089 102 

28. Velden C, Olander T, Zehr R (1998) Development of an objective scheme to estimate trop- 
ical cyclone intensity from digital geostationary satellite infrared imagery. Weather Forecast 
13(1):172-186. https://doi.org/10.1175/1520-0434(1998)013<0172:DOAOST>2.0.CO;2 

29. Velden C, Harper B, Wells F, Beven JL II, ehr R, Olander T, Mayfield M, Guard CC, Lander M, 
Edson R, Avila L, Burton A, Turk M, Caroff A, Christian A, Caroff P, McCrone P, (2006) The 
Dvorak tropical cyclone intensity estimation technique. Bull Am Meteorol Soc 87(9):1195— 
1210. https://doi.org/10.1175/BAMS-87-9- 1195 

30. Wei L, Guan L, Qu L, Guo D (2020) Prediction of sea surface temperature in the China Seas 
based on long short-term memory neural networks. Remote Sens 12(17). https://doi.org/10. 
3390/rs 12172697 

31. Wentz F (1997) A well-calibrated ocean algorithm for special sensor Microwave/Imager. J 
Geophys Res-Oceans 102(C4):8703-8718. https://doi.org/10.1029/96JC01751 

32. Xiang B, Lin SJ, Zhao M, Zhang S, Vecchi G, Li T, Jiang X, Harris L, Chen JH (2015) 
Beyond weather time-scale prediction for Hurricane Sandy and super typhoon Haiyan in a 
global climate model. Mon Weather Rev 143(2):524—535. https://doi.org/10.1175/MWR-D- 
14-00227.1 

33. Yamaguchi M, Koide N (2017) Tropical cyclone genesis guidance using the early stage dvorak 
analysis and global ensembles. Weather Forecast 32(6):2133-2141. https://doi.org/10.1175/ 
WAF-D-17-0056.1 

34. Yi K, Wang X, Cheng Y, Chen C (2018) Hyperspectral imagery classification based on semi- 
supervised broad learning system. Remote Sens 10(5):685 

35. Zhang W, Fu B, Peng MS, Li T (2015) Discriminating developing versus nondeveloping tropical 
disturbances in the Western North Pacific through Decision Tree Analysis. Weather Forecast 
30(2):446-454. https://doi.org/10.1175/WAF-D- 14-00023.1 

36. Zhang Y, Yuen KV (2021) Crack detection using fusion features-based broad learning sys- 
tem and image processing. Comput-Aided Civil Infrastruct Eng. https://doi.org/10.1111/mice. 
12753 

37. Zheng G, Li X, Zhang RH, Liu B (2020) Purely satellite data-driven deep learning forecast of 
complicated tropical instability waves. Sci Adv 6(29). https://doi.org/10.1126/sciadv.aba1482 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution- 
NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/ 
by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in 
any medium or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if you modified the licensed 
material. You do not have permission under this license to share adapted material derived from this 
chapter or parts of it. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Tropical Cyclone Monitoring Based on R) 
Geostationary Satellite Imagery pieci 


Chong Wang, Qing Xu, Xiaofeng Li, Gang Zheng, and Bin Liu 


1 Introduction 


Tropical cyclones (TCs) are extreme weather processes developed over the tropical 
ocean. They are called typhoons in the northwestern Pacific and hurricanes in the 
eastern Pacific and Atlantic. The Northwest Pacific Ocean is the basin with the largest 
number of TCs in the world [5, 11], where TCs can be observed throughout the year 
[48]. TCs can cause huge losses to marine production and transportation, such as 
sweeping away the fishing nets, destroying the fish ravens and cages for breeding, 
leading to the death of a large number of fish and shellfish, destroying the breakwaters 
and offshore oil platforms, overturning ships and aircraft, etc. After landing, they 
usually cause hazardous disasters such as storm surges, mountain torrents, urban 
waterlogging, landslides and debris flows owing to the destructive wind and low 
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pressure, which poses a serious threat to coastal areas and brings great damages to 
human life and property. 

The annual number of TCs generated in the Northwest Pacific Ocean accounts 
for 36% of the total number of TCs in the world [55]. In this basin, TCs occur most 
frequently from July to September. They can easily develop into super TCs under 
the action of high temperature and high humidity. Under the influence of marine and 
atmospheric environment, TCs usually move from the southeast to the northwest. 
Many countries have listed the TC as one of the major natural disasters affecting 
national public security, and clearly proposed to strengthen the research on key 
technologies of TC monitoring and early warning [20, 54]. 

Locating the TC center and estimating its intensity have been widely considered 
as an important part of TC monitoring by meteorological forecasting agencies. Under 
the background of global warming, the intensity of global TCs shows a significant 
increasing trend [20]. Accurate TC center position and intensity information can 
better initialize the numerical models and make them more accurate to predict the 
intensity and movement direction of TCs, especially the rapidly strengthening TCs 
[20, 50], and thus help people to take precautions in advance and reduce losses. 

Traditional TC monitoring platforms include coastal stations, buoys, oil drilling 
platforms, etc.. But the recorded TC data is very few due to the sparse distribution 
of these platforms. With the development of airborne remote sensing technology, 
real-time observation of TCs in a large space and time range is realized by using 
airborne radars or radiometers, but is still limited by extreme weather conditions. 
Since 1970, a major breakthrough has been made in satellite remote sensing. Many 
countries have launched geostationary meteorological satellites in succession, which 
can realize real-time and continuous observation of the earth. Spaceborne sensors 
generally have high spatial resolution and have become an important tool for TC 
monitoring. Although there are more and more observation methods, the monitoring 
of TCs still relies on the experience of experts to a certain extent. Particularly, there 
is a lack of objective and effective monitoring methods for weak TCs at the stage of 
formation or extinction. 

A variety of TC center location methods based on infrared images, scatterometer 
data or synthetic aperture radar images have been developed. These methods can be 
divided into four categories. The first category is subjective method, which locates 
the TC center based on the forecaster’s experience judgment on the Central Dense 
TC Overcast in the satellite images [7—10, 37]. The second is the threshold method, 
which is used to segment and identify the TC’s eye area from satellite images and 
determine the morphological center of the eye area [2, 14, 28, 30, 32, 47]. The third 
is the spiral curve method. As is well known, the structure of the TC cloud system 
is not symmetrical, but spiral. The concept of spiral can be expressed by vector 
distance and the spiral center is the TC center [23, 51]. The fourth category is wind 
vector or cloud motion wind (CMW) method, which uses the wind field retrieved by 
scatterometers or time series of infrared satellite images to establish the relationship 
between the wind vector or CWM and the movement variation of a TC so as to locate 
the TC center [17, 21, 31, 33, 34, 40, 52, 57]. 
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For TC intensity estimation using satellite images, two traditional methods are 
Dvorak technology [7, 10, 12, 19, 56] and empirical regression method [3, 16, 24, 
26, 36, 38, 42-44]. The Dvorak technology was first proposed by Dvorak in 1975 [7] 
and has been operated for more than 40 years. The method assumes that the rotation 
and shape of a TC eye area are related to the strengthening and weakening of a TC, 
and TCs with similar intensities have similar morphological characteristics. In the 
empirical regression method such as the deviation-angle variance technique (DAVT) 
[3], the TC intensity is determined by establishing the relationship between the TC 
intensity and different characteristics. Both methods rely on high levels of artificial 
features converted from satellite images of TCs. However, it is difficult to obtain 
general characteristics of TCs and establish empirical regression models for TCs at 
different development stages and in different regions. 

In the past few years, meteorologists and oceanographers have introduced the con- 
volutional neural network (CNN) into TC monitoring. CNN is a kind of feedforward 
deep neural network, which is trained through back propagation algorithm, and its 
inspiration comes from the natural visual cognitive mechanism of organisms [22]. 
A complete CNN model consists of convolutional layers, pooling layers and full 
connection layers. The convolution layer is used to extract features from the image, 
the pooling layer filters and reduces the number of features, and the full connection 
layer learns the relationship between these features and the model output. The CNN 
model not only avoids the complex image preprocessing, but also does not depend 
on the priori knowledge of TCs, and thus can meet the requirements of automatic 
and objective TC center location and intensity estimation. 

The CNN has achieved great success in image classification and target recognition 
[15, 22, 24, 26, 38, 42, 43, 46] and also shows great application potential in TC 
intensity estimation. Pradhan et al. [36] collected 8138 TC images over the North 
Pacific Ocean and the Atlantic Ocean from 1998 to 2012, and labelled the images 
into eight categories based on the Saffir-Simpson hurricane scale and the HURDAT2 
Best Track dataset. They used the CNN model to classify the TCs and then the 
maximum wind speed (MWS) was calculated according to the probability of each 
TC category. The root mean square error (RMSE) of the derived MWS is 10.19 kt. 
However, there are green shorelines in the satellite images used in their study, which 
may affect the accuracy of TC intensity estimation [16]. Combinido et al. [6] used a 
CNN regression model to estimate TC intensity, but the RMSE of the MWS (13.23 
kt) was larger than that of the CNN classification model proposed by [36]. Also 
with the CNN regression model, Chen et al. calculated the intensity of TCs in the 
global ocean based on more TC images [36, 44] and the RMSE was reduced to 10.58 
kt. Recently, Tian et al. combined the CNN classification and regression models to 
estimate TC intensity [3], and they obtain a smaller RMSE of 8.91 kt. 

The above results show that the CNN classification model or regression model 
performs well in TC intensity estimation. Compared with the Dvorak technology, the 
CNN model does not rely on the subjective judgment of the forecaster, which ensures 
the objectivity of the method. Compared with the empirical regression method, the 
neural network reduces the requirement of the knowledge level of the person who 
uses this method. In addition, although the TC center position is necessary in the CNN 
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model, the information is only used to cut the input images. Therefore, compared 
with the empirical regression method, the CNN model does not require a very high 
TC positioning accuracy. 

In conclusion, the accuracy of TC center location and intensity estimation technol- 
ogy has been greatly improved in recent years, but there are still some limitations. For 
operational TC monitoring, objective, we need a fast and robust objective method, 
and the CNN just meets this requirement. In this chapter, a set of CNN models were 
designed to determine the TC center position and intensity from Himawari-8 geosta- 
tionary satellite images in the Northwest Pacific Ocean. By discussing the influence 
of the sensor channel, the number of satellite images and the configuration of the 
neural network on the model performance, an optimal CNN model was developed 
for automatic TC monitoring. The data and structure of the CNN model are described 
in Sect. 2. Sect.3 and 4 present TC center location and intensity estimation results, 
respectively. Section 5 is the summary. 


2 Data and Methodology 


2.1 Data 


The Japan Meteorological Agency (JMA) launched the Himawari-8 (H-8) geostation- 
ary meteorological satellite in October 2014. The Advanced Himawari Imager (AHI) 
onboard H-8 provides observations of different regions at different modes: Full Disk 
(global scope), Japan Area (scope of two Japanese regions), Specific Area (scope of 
two regions), and Landmark Area (scope of two regions). The Full Disk and Japan 
area are fixed, while the other two specific areas and landmark areas can be adjusted 
flexuously. The scanning range is shown in Fig. 1 (60°S-60 °N, 80°E-160°W). The 
on orbit working life of H-8 is 8 years [4]. After calibration and calibration test, the 
data were provided since July 2015 http://www.eorc.jaxa.jp/ptree. 

AHI’s 16 channels cover the whole Northwest Pacific: three visible, three near- 
infrared, and ten thermal-infrared channels. Full-disk observations are taken every 
10 minutes. The brightness temperature data from five infrared (IR, Channels 7, 8, 
13, 14, and 15, see Table 1 for details from http://www.eorc.jaxa.jp/ptree.) channels 
was obtained. 

Based on AHI data, we collected 6,690 satellite images of 97 TCs over the North- 
west Pacific with a time interval of 3 hours during the whole life cycle of the TCs 
from 2015 to 2018, and the image resolution is 5 km. As shown in Fig. 2, the bright- 
ness temperature data of 5 channels (7, 8, 13, 14, 15) with higher transmittance near 
the large window (center wavelength are 3.9, 6.2, 10.4, 11.2, 12.4 um, respectively) 
were selected to locate TC center and estimate TC intensity. Channel 7 is mainly used 
for observing cloud and natural disasters in the lower layer, channel 8 for observ- 
ing water vapor content in the upper and middle layer, channel 13 for observing 
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Fig. 1 Himawari-8 satellite scanning range 


cloud images and cloud top conditions, and channels 14 and 15 are mainly used for 
observing cloud images and sea surface temperature [4]. 

We use the TC Best Track dataset provided by the tropical cyclone information 
center of China Meteorological Administration (CMA) http://tcdata.TC.org.cn as 
ground truth data. The data set was compiled by Shanghai Typhoon Institute. The 
TC Best Track dataset over the Northwest Pacific from 1949 to 2019 includes the TC 
number and name, time, longitude and latitude of the TC center, the MSW (2 minute 
average wind speed), the minimum air pressure, etc.. The time interval is 6 hours 
before 2017 and has been encrypted to 3 hours for landing TCs since 2017. Starting 
from 2018, the Best Track data has provided 3-h TC information 24 hours before the 
landing activities [1, 45]. Consistent with H-8 satellite observations, we downloaded 
the Best Track data of 97 TCs over the Northwest Pacific during 2015-2018. 


170 C. Wang et al. 


Table 1 Himawari-8 channel setup [4] 


Channel Wavelength Observation object 
(um) 
1 0.46 Vegetation, aerosol observation and color image synthesis 
2 0.51 Vegetation, aerosol observation and color image synthesis 
3 0.64 Sublayer cloud and color image synthesis 
4 0.86 Vegetation and aerosol observation 
5 1.6 Cloud phases 
6 2.3 The effective radius of cloud droplets 
T 3.9 Sublayer cloud and natural disasters 
8 6.2 Water vapor in the upper and middle layers 
9 7.0 Water vapor in the middle layers 
10 7.3 Water vapor in the middle layers 
11 8.6 Cloud phases and SO2 
12 9.6 O3 
13 10.4 Cloud image and cloud top image 
14 11.2 Cloud image and sea surface temperature 
15 12.3 Cloud image and sea surface temperature 
16 133 Cloud height 


Fig. 2 Bright temperature(unit: K) image of different channels of “Soudelor” obtained by AHI on 
Himawari-8 satellite at 18:00 (UTC) on August 15, 2015 a channel 7, b channel 8, c channel 13, d 
channel 14, e channel 15. The image space range is 1255 kmx 1255 km 
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2.2 Data Pre—Processing 


We selected H-8 satellite images synchronized with the 6-h Best Track dataset to 
design the CNN-based TC center location (CNN-L) model. There are 3298 images 
in total, among which 1971 are used for model training, 657 for validation and 670 for 
testing. For each training or validation image, we re-extracted a sub-image covering 
an area of 1500 km x 1500 km (pixel size 301 x 301), and then cut the pixel size from 
301 x 301 to 151x151 three times by moving it up, down, left or right randomly. 
Finally, we labeled the processed image with the number and direction of moving 
pixels. For example, if a TC center is shifted left by 5 pixels, and up by 10 pixels, the 
image is labeled (—5, 10). The test images were only cropped. Finally, we obtained 
5913 training images, 1971 validation images, and 670 test images with a reduced 
size of 151x151. 

More training images may also help to improve the CNN-based TC intensity 
estimate (CNN-I) model performance. Hence, before the estimation of TC intensity, 
we interpolated the MSW provided by the Best Track dataset every 3 hours to increase 
the data samples. The total of 6690 3-h images with a size of 251 x 251 was labeled 
correctly according to eight categories using the Saffir-Simpson hurricane wind 
scale (H1 to H5) along with intensity categorization for the tropical storm (TS) and 
tropical depression (TD) as TC intensity categories (Table 2). The total 6690 images 
are divided into the training, validation and testing images with 4014, 1338 and 1338 
images, respectively. Studies show that the rotation of images in a CNN model can 
reduce the sensitivity of orientation and does not affect the classification accuracy. 
Therefore, the normalized training and validation images were artificially rotated 
90°, 180° and 270° clockwise. In this way, the number of images is increased by 
three times, and finally, we obtained 16,056 training images and 5352 validation 
images for the construction of the CNN based TC intensity estimation model. 


Table2 Saffir-Simpson hurricane wind scale and related classifications (The Saffir-Simpson Team, 
2012). MSW is ten-minute averaged maximum sustained wind speed 


Symbol Category MSW (kt) 
NC No Category < 20 

TD Tropical depression 20-33 

TS Tropical storm 34-63 

H1 Category 1 64-82 

H2 Category 2 83-95 

H3 Category 3 96-112 
H4 Category 4 113-136 
H5 Category 5 >137 
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2.3 Methodology 


The convolutional neural network (CNN) model consists of convolutional layers (C), 
pooling layer (P) and fully connected layer (FC). The convolutional layer is used to 
extract image features. In order to reduce the number of features and speed up the 
calculation, the pool layer uses the filtering methods such as maximum, average and 
minimum values to process these features. The nonlinear relationship between the 
filtered features and the output results is learnt by the full connection layer. To prevent 
overfitting, a dropout layer is usually added before the full connection layer [26], 
which invalidates part of the connection between the upper and lower layers. CNN is 
usually designed as a feed forward network that can be trained with a backpropagation 
algorithm. When the error propagates back in the model, the network is optimized 
by updating the weights and deviations to minimize the value of the loss function 
[22]. 

The CNN model architecture for TC monitoring is shown in Fig. 3. It consists of 
one input layer, four convolutional layers, four pooling layers, one dropout layer, two 
fully connected layers, and one output layer. Table 3 lists the parameter setting of 
the CNN model as shown in Fig. 3. The parameters include the convolutional kernel 
shape, output shape and number of parameters, as well as the convolution kernel 
shape used in each convolution layer, the filter shape used in each pooling layer and 
the step size. Taking the CNN model shown in Fig.3 as an example, the satellite 
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Fig. 3 Framework of the CNN based TC center location model 
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Table 3 The MLE of CNN based TC center location model 


Model number Wavelength (um) MLE (km) 
CNN-L 1 10.4 40.3 
CNN-L 2 11.2 40.5 
CNN-L 3 2.3 40.1 


image with a size of 151151 was input into the CNN model. The first convolution 
operation is performed on the satellite image. The size of each convolution kernel 
was 10x 10, and the step size was 2. In this process, the model needs to learn 3232 
parameters; In the first pooling layer, 32 feature images with the size of 3x3 and 
step size of 2 are filtered, and 32 feature images with the size of 38 x 38 are obtained 
by using the maximum pooling method. After four convolutional layers and four 
pooling layers, 256 abstract feature images of the size of 3 x3 are finally generated. 
These feature images are expanded into 1-dimensional data of the size of 2304, input 
into the dropout layer (the dropout rate is 0.5), and then connected with the first fully 
connected layer. The model needs to learn 2360320 parameters. Finally, the model 
output is obtained in the output layer. In the whole training process, the model needs 
to learn a total of 2915298 parameters. 

For TC center location and intensity estimation from multi-channel satellite 
images, the input channel, the image resolution and model parameters would affect 
the training efficiency and accuracy of the CNN model. By Setting up a group of 
sensitivity experiments to investigate the influence of different factors on the model 
performance, we aim to develop an optimal CNN model for TC monitoring. 


3 TC Center Location 


The CNN model for TC center location, i.e., the CNN-L model, as shown in Figure 
3.3 is established in this section. We use the mean location error (MLE) to evaluate 
the performance of the model: 


LV = 4)? + OF — yp? 
MLE =: 


d) 


n 


where n is the number of test images; (x, y) and (x’, y’) are TC center positions from 
the Best Track data and CNN-L model outputs, respectively. 
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Fig. 4 Himawari-8 Channel 7 images of TCs: a Halola (9:00 UTC on 13 July 2015), b Soudelor 
(18:00 UTC on 30 July 2015), c Atsani (12:00 UTC on 16 August 2015), d Meranti (12:00 UTC 
on 14 September 2016) 


As shown in Fig. 4, there are sometimes some noises in the image of channel 7 or 
8 due to the sensor calibration problem, which would seriously affect the TC location 
accuracy. Therefore, we only use the images of channels 13, 14 and 15 as the input 
of the CNN-L model. 

We take TC “Maria” occurring during 4 July to 11 July, 2018 as an example. 
Figure 5 shows the TC center location results using the network configuration listed 
in Table 3. The mean location errors between different CNN-L model outputs and 
observations are similar, suggesting that the image channel has little effect on the 
accuracy of the TC center location model. 

Using Channel 15 data as the input of the CNN-L 3 model, a variety of network 
configurations listed in Table 4 were further tested. CNN-L 3, CNN-L 4, and CNN-L 
5 models all consist of three or four convolutional layers, three or four pooling layers, 
and two FC layers. However, each model has a different number of kernels in the 
convolutional layers. The stride and zero-padding, the shape of the FC layer, and the 
dropout in each model are listed in Table 4. One can see that CNN-L 3 produces 
the lowest MLE, indicating that too many or few convolutional layers do not help to 
improve the CNN-L model. 

The results of CNN-L 3 model for different categories of TCs are shown in Figs. 5, 
6 and Table 5. Strong TCs generally demonstrate a more distinct and stable structure 
and more obvious TC eye area than weak TCs. As a result, the mean location error 
of the CNN-L 3 model also decreases rapidly with the increase of TC intensity. The 
average MLEs of H1-H5 and H4-H5 TCs are 30 km and less than 25 km, respectively. 
As shown in Table 5, the accuracy of our CNN based TC center location model is 
comparable to that of some techniques that also locate TCs from IR images with 
spatial resolutions of 2.5-12.5 km (Table 6). 
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Fig. 5 CNN based TC center location model results for TC “Maria” occurring from 4 July to 11 
July, 2018 at different stages: a Tropical depression, b Tropical storm, ¢ Category 1, d Category 
2, e Category 3, f Category 4. The red, yellow and purple dots represent the TC center positions 
determined by the CNN-L model with brightness temperature of Channels 13, 14 and 15 as the 


input, respectively. The blue dot shows the TC center position from the Best Track dataset provided 
by CMA 
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Table 4 Results of the CNN based TC center location models with different parameters(For exam- 
ple, C1 means the first convolutional layer and P1 means the first pooling layer, C1(32@3 x 3) 
denotes 32 kernels in the first convolutional layer with a size of 3 x 3) 


Model number 
CNN-L 3 


Parameters 
C1(32@10 x 10), P1(3x3) 


C2(64@5 x 5),P2(3 x 3) 


C3(128@3 x 3), P3(3 x 3) 


C4(256@3 x 3), P4(2 x 2) 


Dropout = 0.5, FC1024, FC128 


MLE (km) 


40.1 


CNN-L 4 


C1(32@10 x 10), P1(3 x 3) 


C2(64@5 x 5), P2(3 x 3) 


C3(128@3 x 3), P3(3 x 3) 


Dropout = 0.5, FC1024, FC128 


42.3 


CNN-L 5 


C1(32@10 x 10), P1(3 x 3) 


C2(64@8 x 8), P2(3 x 3) 


C3(128@4 x 4), P3(3 x 3) 


C4(256@4 x 4), P4(2 x 2) 
Dropout = 0.5, FC1024, FC128 


41.8 
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Fig. 6 Mean location error (km) of the CNN-L 3 model for different categories of TCs. Numbers 
in brackets represent the number of samples in the test group for this category 


Table 5 Mean location error (km) of the CNN-L 3 model 


TC category NC TD TS H1 H2 H3 H4 H5 
Data number 17 203 298 99 43 33 18 3 
MLE (km) 90 53 37 32 25 28 21 15 
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Table 6 Performance of the CNN model in this study and other methods for TC center location 


Literature MLE (km) 
Pal P et al. [32] 11-79 

Jin S et al. [23] 42.1 

Our model 40.1 


4 TC Intensity Estimation 


Similar to Sec.3, the influence of the selection of input data and model parameters on 
the performance of the CNN based TC intensity estimation model, i.e., the CNN-I 
model shown in Fig. 7, is investigated in this section. The possible solution of the 
side effects of imbalanced dataset is also discussed. 
We evaluate the performance of the CNN-I model from three aspects: 
(1) Accuracy 
The number of exact-hits, which refers to the correct classification of a TC with 
the highest confidence, is the accuracy metric. 
(2) Root mean square error (RMSE) and mean average error (MAE) of TC intensity 
For categories TD through H4, we define the estimated intensity or MSW of a 
TC as the weighted average of the two highest categories for their probabilities. 
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Fig. 7 Framework of the CNN based TC intensity estimation model 
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Table7 Results of the CNN based TC intensity estimation model with images of different channels 
as the input 


Model number Wavelength (um) Accuracy (%) RMSE (kt) 
CNN-I 1 3.9 78.1 12.14 
CNN-I 2 6.2 78.2 12.41 
CNN-I 3 10.4 80.3 11.40 
CNN-I 4 11.2 80.3 11.54 
CNN-I5 12.3 80.5 11.28 
CNN-I 6 3.9,12.3 81.0 11.25 
CNN-I7 6.2,12.3 81.3 11.03 
CNN-I 8 10.4,12.3 81.5 11.05 
CNN-I 9 11.2,12.3 80.7 11.71 
CNN-I 10 3.9,10.4,12.3 82.7 10.76 
CNN-I 11 6.2,10.4,12.3 82.5 10.89 
CNN-I 12 3.9,6.2,10.4,12.3 82.9 10.64 


Otherwise, we use the mean speed of the category that has the highest confidence. 
The MSW (W) is evaluated as [35]: 


W=U, x Pi + U2 x Po (2) 


where P; and P3 are the probabilities which output by the model of the TC categories 
with the highest and second-highest confidence, respectively; U; and U2 are the mean 
wind speed of the corresponding category. 
(3) Confusion matrix and classification report 

As shown in Table 9, The confusion matrix depicts a model’s overall classification 
performance. The number along the diagonal line in a confusion matrix represents 
the number of correctly identified images for any category. A CNN model’s pre- 
cision (P), recall (R, or confidence of detection), and fl-score (F1) are described 
in the classification report. The ratios of real positive class values to total positive 
classifications and the number of positive class values in the test data, respectively, 
are P and R. Fl = 2P x R/(P + R) is the harmonic mean of recall and precision. 

Table 7 shows the results of the CNN based TC intensity estimation model based 
on 1338 test images with the model parameters shown in Fig. 7. 

As shown in Table 7. for CNN-I models with single-channel input, Channels 13- 
15 can achieve a higher accuracy. Channels 14 and 15 correspond to the thermal-IR 
bands, which are often used to observe the cloud and calculate surface temperature 
(SST). From a different perspective, the data support the conclusion of [41] that TC 
intensity is connected with the temperature deficit of cloud top against sea surface. 
We first combined Channel 15 with the other channels in CNN-I 6 through CNN-I 
9 models to examine the use of multi-channel combination in TC intensity estimate. 
Comparing CNN-I 9 with CNN-I 5, we can see that the combination of Channels 
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15 with 7, 8 or 13 improves the model performance, while Channel 14 has little 
contribution to the improvement of the model because its wavelength is close to 
Channel 15. These two thermal infrared channels may provide redundant information 
for TC intensity estimation, resulting in data redundancy. 

With the input of more (3 or 4) channels of data, the 4-channel model CNN-I 12 
produces the best result with an accuracy of 82.9% and a RMSE of 10.64 kt. This 
is also consistent with some theoretical studies, indicating that the information of 
water vapor, cloud characteristics, the brightness temperature difference of cloud top 
and sea surface provided by the combination of channels 7, 8, 13 and 15, plays an 
important role in TC intensity estimation [29, 49, 53]. Through the combined input 
of multi-channel satellite images, the CNN-I model can learn the complex nonlinear 
relationship between various elements and TC intensity. 

A variety of network configurations were tested further using the CNN-I 12 model, 
and the results are showed in Table 8. Four convolutional layers, four pooling layers, 


Table 8 Results of the CNN based TC intensity estimation model with different parameters 
Model number | Parameters Accuracy RMSE (kt) 
C1(16@10 x 10), P1(3x3) 

C2(32@5 x5), P2(3x3) 
CNN-I 13 C3(64@3 x3) P3(3 x3) 82.9% 10.64 
C4(128 @3x3), P4(3 x3) 
Dropout = 0.5, FC1024, FC128 
C1(32@10x 10), P1(3x3) 
C2(64@5x 5), P2(3 x3) 
CNN-I 12 C3(128 @3 x3) P3(3x3) 84.8% 10.19 
C4(256@3 x3), P4(3 x3) 
Dropout = 0.5, FC1024, FC128 
C1(64@ 10x 10), P1(3x3) 
C2(128 @5x5),P2(3 x3) 
CNN-I 14 C3(256@3 x3) P3(3 x3) 80.1% 11.48 
C4(512@3x3), P4(3 x3) 
Dropout = 0.5, FC1024, FC128 
C1(32@ 10x 10), P1(3x3) 
C2(64@5 x5), P2(3x3) 
CNN-I 15 C3(128 @4x 4) P3(3 x3) 71.8% 12.59 
C4(256@3 x3), P4(3 x3) 
Dropout = 0.5, FC1024, FC128 
C1(32@ 10x 10), P1(3 x3) 
C2(64@5 x5), P2(3x3) 
CNN-I 16 C3(128 @3x3), P3(3 x3) 86.0% 10.06 
C4(256@3 x3), P4(3 x3) 
Dropout = 0.5, FC1024, FC128 
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and two fully connected layers make up CNN-I 13-16 models. For CNN-I 12 to 
CNN-I 15, the number of convolution kernel in the convolutional layer increases 
gradually. The model’s accuracy improves slightly as the number of convolution 
kernels increases (CNN-I 13), but it decreases dramatically beyond a certain range 
of kernel numbers (CNN-I 15). Although more kernels allow for the extraction of 
more feature maps, these maps may not always have a favorable impact on the 
model’s improvement. In addition, the decrease of the model performance with the 
increasing number of convolution kernel may also be associated with the number of 
training samples. More complex CNN models need to learn more parameters. It is 
difficult to obtain a high accuracy when the difference between the training sample 
size and the number of model parameters is too large. Compared with the CNN-I 12 
model, CNN-I 14 changes the step size of the convolution kernel operation, but the 
accuracy is reduced. 

Recently, Woo et al proposed the spatial attention and channel attention mech- 
anism, which is based on the study of human vision [54]. As show in Fig. 8, after 
adding the spatial and channel attention layers, the CNN-I 17 model gives the highest 
accuracy (86.0%) and the lowest RMSE (10.06 kt) of the MWS. The CNN based 
TC intensity estimation model can focus on the key factors revealed by the attention 
mechanism, which helps to improve the accuracy of the model. 

In general, the CNN-I 16 multi-category classification model does a good job at 
estimating TC intensity. However, as demonstrated in Tables 9, 10, the classification 
results for TC categories with little training samples are not particularly satisfactory. 
For example, there are 62 samples in the H3 category, but only 48 have been identified 
correctly. Only 77.0% of the H3 category is accurate. The degradation is caused by 
an imbalance in the training data across different types of TC datasets. Taking H3 as 
an example, the H3 category accounts for only 2.3% of total training numbers. Even 
if most of the H3 images are misclassified during the training step of the CNN-based 
TC intensity estimation model, the loss will only increase somewhat.Because the 
CNN adjusts the weight value of each layer in response to the loss, the network will 
struggle to learn the features of a category if there are few samples. As a result, the 
accuracy of the CNN-I model for H2 category is lower. 

We use Focal_loss function to replace the original loss function in the CNN 
TC intensity estimation model. This function aids a model’s learning of features by 
raising the category’s weight with fewer data in loss, and has demonstrated excellence 
in the field of target recognition. In this way, In the event of limited samples, the 
model can better learn the TC category’s relevant attributes. The Focal_loss function’s 
definition is as follows: [27]: 


FL@,) = —a;(1 — p,)'log@;,) (3) 


where p; is the output of model for NC to H5 category, a, is the weight coefficient 
which is determined by the proportion of the number of NC to H5 category to the 
total data, and r is the empirical parameter. The values of a, and r for each TC 
category used in this study are shown in Table 11. The accuracy of the TC intensity 
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Fig. 8 Schematic diagram of spatial and channel attention layers in the CNN based TC intensity 
estimation model a, channel attention b, and spatial attention c. “+” and “x” are plus and multiply 
signs, respectively 


estimation model (CNN-I 17) using the Focal_loss function is improved to 86.6%, 
and the RMSE is reduced by 2.1%. 

In many domains, multi classification can be transformed into target recognition 
or binarization problem [18]. In this study, we further used eight binary models to 
replace the multi-classification model, and each binary model can learn the TC char- 
acteristics corresponding to each intensity category. Eight CNN based TC intensity 
estimation binary models were constructed to identify NC to H5 category, respec- 
tively. 

The configuration of each model in Table 12 is the same as that of CNN-I 16. 
The Focal_loss with values of a, and r listed in Table 11 was also used. We changed 
the classification label to “1” or “0”, which represents the intensity of whether a TC 
corresponds to a particular category. If the maximum sustained wind speed of a TC 
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Table 9 Confusion matrix of the CNN-I 16 multi-classification model 
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MODEL ACTUAL CATEGORY 
CATEGORY 

NC H1 H2 H3 H4 H5 Total 
NC 22 0 0 0 0 0 25 
TD 11 1 0 0 0 0 390 
TS 0 18 3 1 1 0 549 
H1 0 145 14 2 2 0 194 
H2 0 12 58 6 0 0 76 
H3 0 2 8 48 4 (0) 64 
H4 0 0 2 5 29 1 37 
H5 0 0 0 0 0 3 3 
Total 35 178 85 62 36 4 1338 

Table 10 Classification report of the CNN-I 16 binary classification model 
TC Category P(%) R(%) F1(%) 
NC 63 88 73 
TD 90 89 89 
TS 89 90 89 
H1 81 74 78 
H2 68 76 72 
H3 77 75 76 
H4 80 78 79 
H5 75 100 85 
Total 86 86 86 
Table 11 Values of a; and r in FOCAL_LOSS function 

TC Category NC TD TS H1 H2 H3 H4 H5 
r 2 2 2 2 2 2 2 2 
a(%) 97 70 60 86 94 95 97 99.7 


sample is 52.0 kt, which belongs to the TS category, the corresponding label of this 
image is “1” in the binary model which is responsible for judging the TS category, 
and "0" in the other binary models. 

As shown in Table 12, the CNN-I 18 binary model has a much higher performance 
than that of the multi-classification model CNN-I 16. Compared with CNN-I 17, the 
accuracy of CNN-I 18 is improved to 88.9%, and the RMSE is reduced to 8.99 kt. The 
results show that the introduction of the Focal_Loss function and the transformation 
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Table 12 Results of the CNN based TC intensity estimation multi models with/without Focal_loss 
function and binary model 


Model number Method Accuracy(%) RMSE (kt) 

CNN-I 16 CNN multi model 86.0 10.06 

CNN-I 17 CNN multi model 86.6 9.84 
with Focal_loss 

CNN-I 18 CNN binary model 88.9 8.99 
with Focal_loss 


Table 13 Confusion matrix of the CNN-I 18 binary classification model 


MODEL ACTUAL CATEGORY 
CATEGORY 

NC TD TS H1 H2 H3 H5 Total 
NC 24 2 0 0 0 0 0 26 
TD 11 354 24 2 0 0 0 391 
TS 0 25 511 14 1 0 0 551 
H1 0 2 14 152 12 3 0 183 
H2 0 2 2 7 62 4 0 79 
H3 0 1 3 9 52 0 70 
H4 0 0 0 1 3 0 35 
H5 0 0 0 0 0 4 4 
Total 35 386 552 178 85 62 4 1338 


of the multi-classification model into eight binary classification models helps to 
reduce the side effects caused by the imbalanced dataset. 

The CNN-I 18 binary classification model’s confusion matrix and classification 
report are shown in Tables 13 and 14. The number of exact-hits for NC, TD, TS, H1, 
H2, H3, and H4 all rose when compared to the results from the multi-classification 
model (Tables 9 and 10). The precisions of H1-H4 classification have improved by 
4.9%, 74%, 9.1%, and 3.8%, respectively. 

Table15 compares the performance of the CNN model and other TC intensity 
estimation methods. The RMSE of the maximum wind speed estimated by the CNN 
based TC intensity estimation model proposed in this study is smaller than that of 
the DAVT technique or most CNN regression or classification models, which proves 
the potential of the CNN method in TC monitoring. 
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Table 14 Classification report of the CNN-I 18 binary classification model 


TC Category P(%) R(%) F1(%) 
NC 69 92 79 
TD 92 91 91 
TS 93 93 93 
H1 85 83 84 
H2 73 78 76 
H3 84 74 79 
H4 83 86 85 
H5 100 100 100 
Total 89 89 89 
Table 15 Comparison of our Literature RMSE (kt) 
model and other methods 
Kossin et al. [25] 13.2 
Ritchie et al. [39] 12.7 
Fetanat et al. [13] 12.7 
Pradhan et al. [36] 10.2 
Chen et al. [3] 10.6 
Tian et al. [45] 8.9 
Our model 8.9 


5 Summary 


Accurately locating the TC center and estimating its intensity is an essential step for 
forecasters and emergency responders to make disaster warnings. In this chapter, a 
set of CNN-based model has been developed to automatically identify TC’s center 
(CNN-L model) and intensity (CNN-I model) from H-8 geostationary satellite IR 
imagery, which can provide a reliable technical and information support for TC 
prediction and early warning systems. 

Results show that the selection of satellite image channels has a significant impact 
on the performance of the TC intensity estimation model but hardly affects the TC 
center location model. Network parameters play an essential role in both models. 
The mean distance between the TC centers identified by the CNN-L model and by 
the Best Track dataset is 30 km for TCs in categories H1—-H5. The accuracy of our 
CNN-L model is comparable to some techniques that locate a TC center based on 
its morphological features in IR images. Using four-channel (Channels 7, 8, 13, and 
15) IR imagery, we found that the CNN-I 16 model has the best performance among 
the multi-classification models. 

For TC categories with smaller training datasets, due to the unbalanced dis- 
tributions of TC categories, the multi-classification model cannot produce a very 
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good result. By introducing the Focal_loss function in the CNN model and adopt- 
ing eight binary classification networks, the side-effect of the unbalanced training 
data is reduced. In TC intensity estimate, the binary classification model CNN-I 18 
gives a substantially lower RMSE (8.99 kt) of the maximum wind speed than the 
multi-classification model. 
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and Bin Zhang 


1 Introduction 


The ocean plays a vital role in regulating global climate change, About ~30% of 
total emissions since the pre-industrial period has been stored in the ocean, However, 
about 50% of the oceanic uptake of anthropogenic carbon takes place in the Southern 
Ocean. It dominates the global heat and carbon dioxide absorption, therefore, many 
scientists regard the Southern Ocean as the main research region. The “Southern 
Ocean” (< 35°S) was proposed by scientists around 2000 and was determined to be 
the fifth largest ocean in the world. It is the only ocean that completely surrounds 
the earth but is not divided by continents. It has important differences from ocean 
currents in the Pacific, Indian and Atlantic oceans—Antarctic Circumpolar Current 
(ACC). Moreover, the Southern Ocean is also an important region for global carbon 
absorption and release. Before industrial time, due to the influence of upwelling 
in the Southern Ocean, it has become a major carbon source region [6]. With the 
influence of human activities, the atmospheric pressure gradient shifted and turned 
into a carbon sink region. In the following section, We use the SOCAT dataset to 
build a Feedfoward neural network (FFNN), based on this network we reconstruct 
the Southern Ocean pCO, data and calculate the CO, flux changes in the region, 
compare with other method, Our algorithm is compared with two neural network 
algorithms and has a smaller root mean square error. 
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1.1 Observations of pCO2 in Southern Ocean 


Many data of the carbonate system can only be obtained by in-situ measurement. Due 
to the harsh environment of the Southern Ocean, the data collection is lacking. For sea 
surface data, through the continuous efforts of the scientists, the Surface Ocean CO 
Atlas [13] has complies and quality control of ship data, fixed-point observation 
data, and drifting buoy data to formed a relatively complete observation data set 
(Fig. 1). This data set contains the pCO, data which can be used to calculate the 
sea-air carbon dioxide flux. We will use this database as the truth value to construct 


our neural network and reconstruct the pCO, gridded data of the entire Southern 
Ocean. 


KS | 


Observation Data Number 


0.0 200.0 400.0 


600.0 
Data Min = 1.0, Max = 315693.0 


Fig. 1 1998-2018 SOCAT data observation heat map 
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1.2 Comparison of Reconstruction pCO2 Data 


The results obtained by some traditional atmospheric inversions algorithms are 
greatly affected by the amount of observational data [17, 20]. Some spatial and 
temporal interpolations are based on empirical relationships between carbon dioxide 
and alternative variables, and are mainly concentrated in areas with relatively rich 
observations. 

Neural network approaches have been frequently used in the reconstruction of 
surface pCO; in recent years. To recreate the pCO, data of the Southern Ocean, Gre- 
gor et al. employed a support vector machine (SVM) and a random forest (RF). The 
root-mean-square errors (RMSEs) were 16.45 jzatm and 24.04 atm, respectively. 
Meanwhile, Landschutzer et al. [11] created the SOM-FFNN method by combining 
a self-organizing map (SOM) with a feedforward neural network (FFNN) to recreate 
pCO, data from the Southern Ocean. Sea surface temperature (SST), sea surface 
salinity (SSS), Mixed Layer Depth (MLD), chlorophyll concentration (CHL), and 
other metrics are used as inputs. The study shows that during the period 1980-2000, 
the Southern Ocean carbon sink has remained stagnant or even weakened, and con- 
tinued to increase after 2002. Both data products showed good interannual and sea- 
sonal cyclical changes, but compare with the traditional machine learning algorithm 
(SVM and RF), SOM-FENN show better performance. Denvil-Sommer et al. [3] 
employed the Laboratory of Climate and Environmental Sciences (LSCE)—FFNN 
method to reconstruct global pCO, data, which maintained consistency with obser- 
vational results. However, compared with the observed data, the Southern Ocean’s 
reconstructed data has a larger error than other regions with more in situ observations. 

In this chapter, we use the Surface Ocean CO2 ATLAS (SOCAT V.6) data from 
1998 to 2018 in the Southern Ocean, we applied the (CA)—FFNN method to recon- 
struct the monthly and 1° x 1° pCO, data of the Southern Ocean. Due to FFNN 
produces more stable data in sparse areas [20], and interpolates the data with small 
deviation [12], we use this method to reconstruct the Southern Ocean regional data. 
The procedure is separated into two parts. First, each parameter’s correlation index 
is calculated and arranged. Second, the pCO, data in the southern ocean blank area 
was interpolated using a relational model employing parameters with reasonably 
strong correlation coefficients as input variables of the FFNN. The current scenario, 
in which stations with less observation data have larger RMSE values, is improved by 
this strategy. As a result, this method might be used to recreate regional data. Finally, 
we looked at pCO, fluctuations in the Southern Ocean on a seasonal, interannual, 
and interdecadal scale. 
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2 Data and Methods 
2.1 Data 


The parameters used in the CA method included SST , SST anomaly (SSTA), SSS, 
and SSS anomaly (SSSA); these parameters were all from the gridded dataset of 
Global Ocean Heat Content Change [2], while anomaly data were obtained by sub- 
tracting the average data values from the climatic state data of each month. Chloro- 
phyll concentration (Chl-a) were based on satellite remote sensing data from the 
European Space Agency’s Global Color Project, while MLD data were obtained 
from the French Institute of Marine Development. The u- and v- components of the 
wind field at 10 meters above sea level (a.s.1.) were taken from the European Centre 
for Medium-Range Weather Forecasts. All these data except MLD are monthly aver- 
ages over a 1° x 1° Lat/Lon box. MLD data is monthly averages over 0.5° x 0.5°. 

In this chapter, we convert the fCO, data in the SOCAT data set to pCO» data 
as the training set and test set of FFNN. Transformation relationship between fCO, 
and pCO; is as follows [10]: 


d) 


B+2ô 
FCO = pCO; - exp (» a ) 


Rx Tsubskin 
where p is the atmospheric pressure (Pa), R is the gas constant (8.314 J K molt), 


SST is the sea surface temperature (K), Tsubskin is the subskin temperature and B and 
6 are the correction coefficients, which are calculated as: 


Tsubskin = SST +0.17 (2) 


3 
B (2 ) = (—1636.75 + 12.0408SST — 3.27957 x 107?SST? + 3.16528 x 10-° SST?) x 1076 
mo 
(3) 


m? —6 
ô | — ] = (57.7 — 0.118 T,ubskin) X 10 (4) 
mol 


The partial pressure of atmospheric CO2 was calculated by the following 
formula [14]: 
pCO,, = xCO> | Peg — VP (H20)] (5) 


where xCOyj is the dry air mixing ratio of atmospheric CO). The relevant data are col- 
lected from the reference data of marine boundary layer in the Earth System Research 
Laboratory of the National Oceanic and Atmospheric Administration (NOAA). Addi- 
tionally, Peq is the pressure at equilibrium, and VP (H20) is the steam of seawater 
at a given temperature [8] 
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Tsubskin Tsubskin 
(18.678 234.5) % 257.14+Teubskin 


VP = 0.61121 x e (6) 
where the Tyubskin is subskin temperature. 

In order to reduce the complexity of calculation of too large data set on neural 
network learning, we use Eq.7 to normalize all data. 


ge x — min (x) (7) 


max (x) — min (x) 


where x is actual value, min(x) is the minimum value of x, max(x) is the maximum 
value of x. 

Since the Chl-a data in this study did not include relevant records before the 
launch of SeaWiFS in 1997, our research period was from 1998 to 2018. The spatial 
resolution of all parameter data was 1° x 1°. Longitude (Lon) and latitude (Lat) are 
in 360° and 180° coordinate systems, and trigonometric conversion functions were 
used to ensure continuity and normalization. 


2.2 Nonlinear Neural Network Model for the pCO2 
Reconstruction in the Southern Ocean 


We use Equations 8 and 9 to calculate the correlation coefficient, and build a covari- 
ance matrix between pCO, and other collected data, as shown in Fig. 3. 


Cov (, Y) = E[(X — ux) (Y — uy) (8) 
Cov(X, Y) 

EA e  2 9 

BB, “l 


where u is the mean of the value, ( is standard deviation of the value, Cov(X, Y) is 
the calculated covariance matrix, and p is the correlation coefficient. 

We use the parameters with correlation coefficients > 0.1 as the input parameters, 
considering the relevance of chemical effects between SST and pCO, [18], We still 
use SST as an input parameter. After correlation analysis, the selected parameters 
were the SST, SSSA, MLD, CHL, the u-component (U) of the sea surface wind field, 
and the partial pressure of atmospheric C O2 (pCO24). The established correlation 
equations between pCO, and the main parameters are summarized in Eq. 10. 

SST, SSSA, CHL, MLD, 
pCO, = F( U, aCO;, Lon, Lat ) (10) 


A nonlinear regression model was built using the FFNN. Although an FFNN’s 
output data improves and becomes more accurate as the number of layers and neurons 
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Tanh activition 


Input layer Output layer 


Hidden layer 


Fig. 2 The Structure of our FFNN, The gray square is the dropout layer and dropout rate is 0.5, 
blue square is input layer, yellow square is hidden layer, green square is output layer 


mid 


aCO2 windy windu 


pCO? 


mid windu windy 


Fig. 3 Matrix of correlation coefficients. The correlation coefficient value of the x-axis and y-axis 
parameters is represented by each colored box. The value of the pC O2 correlation coefficient with 
other parameters is contained within the blue box 


in the FFNN grows, the model’s size is also determined by the amount of data utilized 
for model training. Because there is less observational data for the Southern Ocean 
than for other regions, we built a simple FFNN structure, the neutral network structure 
of which is shown in Fig. 2. The final model at Step 2 has eight layers (six hidden 
layers), and the numbers on the figure represent the size of the tensor input to each 
layer. A gray square represents the dropout layer, and the dropout rate is 0.5. The 
hyperparameters of the neural network were determined using k-fold cross-validation 
(Fig. 4). 

The data were divided into 75%/25% portions used for training/testing sets. The 
neural network consists of eight layers, and the middle layer had six completely 
connected hidden layers, we added three dropout layers and gave each layer’s dropout 
ratio 0.5 to prevent the FFNN from overfitting. Through many tests and detailed 
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Fig. 4 k-Fold cross-validation, which was divided into four folds in this study, with 25% data for 
testing and the rest for training to create the best neural network. The yellow shape represents test 
data, whereas the blue shape represents train data 


analyses, the hyperbolic tangent (Tanh) was selected as the activation function of the 
neuron, and the using the mean squared error (MSE) as the loss function: 


N 
1 2 

MSE = — b d; — dicted; 11 
y (observe predicted, ) (11) 


i=1 


where observed; is the observation data, and predicted; is the data predicted by the 
FFNN model, and we using RMSProp as the optimization function [21]. 

In order to control the amount of information, we adjusted the adaptive learning 
rate. The CA-FFNN was then formed by combining a main factor analysis and based 
on the parameters, we build a FFNN structure and get a nonlinear regression model 
through training. 


2.3 Calculation of Carbon Dioxide Flux in the Southern 
Ocean 


The formula for calculating the carbon dioxide flux at the air-sea interface is [29] : 
F=K. A f CO, =K.- (AsubskinfCO2y = adskinfCO2a) (12) 


where a is the solubility of CO, in seawater (mol kg! atm`!), calculated by Weiss 
[10]: 


100 Tubskin 
lna = —60.2409 + 93.4517 23.3585 x In S 
subskin 100 


T T 2 (13) 
subskin subskin 
x foonssi — 0.023656 x ( 100 ) + 0.0047036 x (==) l 
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In Equation 12, asubskin is calculated by the subskin temperature, askin is calculated 
by the skin temperature. fCO>,, is the fugacity of subskin seawater CO2, fCO2,, is 
the fugacity of subskin seawater CO2, fCO>, is the fugacity of atmospheric CO), and 
K is the exchange rate, which is usually considered as a function of wind speed . 


K =1(660/Sc)°°U? (14) 


Here, Sc is the Schmidt number of CO; in seawater at a given Tyubskin temperature, 
such that: 


Sc = 2073.1 — 125.62 X Teubskin + 3.6276 x T24pin — 0.043219 x T3 gin (15) 


where U is the monthly mean wind speed (m/s) at 10 m height from the cross- 
calibrated multi-platform ocean surface wind vector analysis product and I is the 
scale factor which was evaluated based on different wind speed products (e.g., 0.39, 
0.251, 0.31, etc.) and have been used in other studies [14, 24, 28]. Based on an 
average wind speed of 6.38 m s™! in the ECMWF product the scale factor of 0.31 
was used to reach a global mean transfer velocity of 16 cm h™!, consistent with the 
new radiocarbon-based constraints. 


2.4 Evaluation 


Due to the limited observation data in the Southern Ocean, the data set used for 
verification will be very small, so the segmentation of the data set will lead to huge 
differences between RMSE and mean absolute error. In order to ensure reliable model 
verification, we used 100% data to train, test and verify the model, and continuously 
optimized the neural network model and the internal weight. Finally, the neural 
network was used to predict the observed area. RMSE is calculated to be 8.86 uatm, 
while MAE is 5.01. 

Figure 5 shows that the predicted values are very close to the observed values and 
R’= 0.93. In Table 1, we list the RMSE and MAE between the results of different 
algorithms and the actual values. SOM-FFNN merged a self-organizing map (SOM) 
and feedforward neural network, and the RMSE is 12.24. LSCE-FFNN employed 
the Laboratory of Climate and Environmental Sciences, and the RMSE is 17.40. We 
conclude that the CA-FFNN-based models outperform both the SOM—FFNN and 
LSCE-FFNN. 
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Fig. 5 Scatter fit of product data and observation data with same station 


Table 1 Comparison of our Algorithms’ Errors to LSCE-FFNN and SOM-FFNN 


Artificial Intelligence RMSE MAE 
Algorithm 

FENN for Southern Ocean 8.86 5.01 
LSCE-FENN [3] 17.40 11.92 
SOM-FENN [12] 12.24 7.36 


3 Results and Discussion 


3.1 Seasonal Variation in Southern Ocean Sea Surface pCO2 


According to the new dataset, the pCO, data changes periodically with the seasons. 
This result is consistent with the seasonal changes in other studies [16, 25, 27]. The 
seasonal mean amplitude of ocean surface pCO, in the southern ocean was 13.02 
patm and our data has similar seasonal variation characteristics compared with the 
observational data of the Southern Ocean [15], the pCO, reaching its minimum in 
summer, and increase in winter (Fig. 6), and driven by both biological and physical 
factors, pCO, in the Southern Ocean shows obvious seasonal changes [22], In winter, 
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Fig. 7 From 1998 to 2018, the normalized mean monthly U-component of wind and pCO was 
calculated 


due to the enhancement of the wind field in the Southern Ocean, as shown in the Fig. 7, 
the Ekman transport caused by the wind field also intensifies [1, 7], strengthening 
upwelling and improving the efficiency of the biological pump. 

The dissolved inorganic carbon in the bottom layer migrates to the surface layer 
under the influence of the upwelling, making the surface pCO? increase continuously. 
With the melting of sea ice in the Southern Ocean in summer, marine primary pro- 
ductivity gradually recovers, the Chl-a concentration increases, as shown in Fig. 8, 
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Fig. 8 From 1998 to 2018, the average monthly CHL and pCO) data were normalized 


and CO) in sea water is absorbed through photosynthesis [26], which lead to surface 
pCO, decrease. This period is mainly due to biological factors. 


3.2 Annual Variation in Southern Ocean Sea Surface pCO2 


Analyzing the inter-annual change of the reconstructed pCO? data from 1998 to 2018, 
the mean surface pCOz of the Southern Ocean increased from 351.88 jatm to 372.65 
patm—a total increase of 20.77 atm in 21 years and an annual mean increase of 
0.99 uatm/yr. As shown in Fig.9, the Southern Ocean pCO, has maintained a high 
growth rate. 

By calculating the linear rate of change in the Southern Ocean spatial region over 
a 21-year period, it is found that the pCO, in most areas is gradually increasing, 
as shown in Fig. 10. The growth rate around 35 °to 55° is faster than other regions. 
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Fig.9 ais monthly fluctuations in the Southern Ocean’s pCOz (atm) from 1998 to 2018; b is yearly 
fluctuations in the Southern Ocean’s pCOz (atm) from 1998 to 2018 
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Fig. 10 The rate of change in the surface Southern Ocean’s pCO} (j:atm yr!) concentration 


Since 2002, many study results have shown that pCO, in the Southern Ocean has 
maintained a high growth rate [23], and our data also shows this trend. 


3.3 Variability in Sea—AirCO2 Flux 


As for the rate of change of Af C O2, Most of the Southern Ocean is transforming into 
a carbon sink area. The black/red dots in Fig. 11 represent AfC O2 regions toward 
positive/negative trends with high change rate. According to the distribution of pCO, 
in the Southern Ocean since 1998, the status of inner ring (50 — 70°S) as a carbon 
source is changing, while the outer ring (35 — 50°S) has always maintained a strong 
carbon sink state, and there is no tendency to weaken. The changes of CO) flux in 
the Southern Ocean calculated by our model are consistency with other models for 
the evolution of intensity [19]. Using Eq. 12 to calculate the CO) flux, the Southern 
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Fig. 11 Carbon sink when AfC O2 < 0, carbon source when AfC O2>0, Rate of change in the Af 
C O2 of the Southern Ocean 


Ocean’s CO; flux was found to have changed substantially over the past two decades. 
The AfC O2 in the Southern Ocean also changes regularly with the seasons, with 
the strongest in early summer and get the weakest at the end of winter (Fig. 13). 
Many studies have shown that in early 1990s, the Southern Ocean was saturated 
with carbon and regained its vitality at the beginning of the 21st century [4]. The 
data products reproduces the strong increase of carbon sinks in the Southern Ocean 
since the 21st century (Fig. 14). 

In terms of interannual changes, the carbon sink of the Southern Ocean increased 
from -0.21 Pg C yr! in 1998 to -1.67 Pg C yr! in 2018. 

One standard deviation was used as an indicator of error: 


Ei @ — x)" 


n2 


(16) 
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Fig. 12 Cp flux trends in of the Southern Ocean from 1998 to 2018 


where x; is the actual value, x is the mean value of x, n is number of data, and the 
error range was within + 0.0.087 Pg C yr!. 

We found that the carbon sinks in the Southern Ocean did not always maintain a 
trend of rapid growth. During 2010-2013, the carbon sinks stagnated. As shown in 
Fig. 12, we found the similar phenomenon in many other reconstructed data [5].Many 
studies have shown that changes in the Southern Annular Mode (SAM) led to the 
stagnation of carbon sinks in the 1990s [5]. However, the stagnation was not strongly 
correlated with the SAM. Stability during this period was mainly due to the weak- 
ening of the carbon sink intensity from 35 — 50°S.Changes in this region have also 
been attributed to the barometric asymmetry of the Zontal Waves 3 (ZW3) model 
[9]. As for models that rely on observational data, it is difficult to capture such large 
and subtle inter-annual changes. 

As shown in Fig. 16, there is an obvious double-ring structure before 2010, which 
is not always a carbon sink. The inner ring (50 — 70 °S), change with the seasons. In 
April, May, June, July, August, and September, the region serves as a carbon source, 
emitting CO? into the atmosphere. In October, November, December, January, Febru- 
ary, and March, it absorbs CO), as shown in Fig. 14. The outer ring (35 — 50°S) is 
the main carbon sink region (Fig. 15), and undertakes most CO, absorption. From 
the perspective of the inter-annual changes in the entire region, the Southern Ocean 
carbon dioxide flux changes to carbon sinks. 

However, with the increase of carbon sink in the outer ring and the weakening 
of the carbon source in the inner ring, after 2010 this ring structure is gradually 
disappearing. As shown in Fig. 16, most Southern Ocean regions become carbon 
sinking regions, because the Af C O2 in the Southern Ocean decrease significantly 
since 1998. 
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Fig. 13 Changes in AfC O2 values by month and year from 1998 to 2018 (atm). The gray lines 
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Fig. 16 In the Southern Ocean, mean sea surface C O% fluxes (Pg C) were measured in 1998, 2003, 
2006, 2010, 2014, and 2018 


4 Conclusion 


In this chapter, we propose a feedforward neural network for reconstructing pCO» 
data in the Southern Ocean that is generalizable for reconstructing regional data. 
The reconstruction process consists of two steps. First, we collect all parameters 
that may have impact on pCO, from the literature and experimental data and obtain 
the covariance matrix of the variables by calculation. The parameters with higher 
correlation coefficient values and an effect on the process change of pCO, were kept 
as inputs FFNN, and the final model was constructed and used to reconstruct the 
pCO, data of the Southern Ocean with a monthly temporal resolution and a spatial 
resolution of 1° x 1° in the second step after continuous and iterative calculation 
and optimization. 

First of all, we find the key parameters that affect pCO, in the Southern Ocean 
changes. Secondly, use the advantages of neural network technology to interpolate 
in the data sparse area, and build a new model by filtering parameters. Finally, in the 
Southern Ocean, we compare the new data with the measured data and get the root 
mean square error with 8.86 jzatm which is better than the data reconstructed from 
global data. 

The results of our reconstruction demonstrate that pCO, in the Southern Ocean’s 
surface layer varies seasonally and has risen since 2000. It did, however, reach a halt 
from 2010 and 2013, after which it resumed its upward trend. In the Southern Ocean, 
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carbon dioxide flux is distributed in a double ring shape. The primary carbon sink 
region is 35 — 50°S; south of 50°S, seasonal carbon sources and sinks alternated. 
Despite the fact that our findings are consistent with earlier studies, the reconstructed 
surface pCO» products require ongoing verification. Our model will improve as the 
frequency and range of observations in the Southern Ocean increase. 
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Detection and Analysis of Mesoscale R) 
Eddies Based on Deep Learning gecik 


Yingjie Liu, Quanan Zheng, and Xiaofeng Li 


1 Introduction 


Mesoscale eddies are circular currents of water bodies with spatial scales from tens to 
hundreds of kilometers and temporal scales from days to years [7]. Mesoscale eddies 
play a significant role in the transport of momentum, mass, heat, nutrients, salt, and 
other seawater chemical elements across the ocean basins, effectively impacting the 
global ocean circulation, large-scale water distribution, air-sea coupling, and biolog- 
ical activities [2, 3, 7, 8, 13, 18]. Mesoscale eddies can be generally classified as 
either cyclonic eddies (CEs) if they rotate counterclockwise (in the Northern Hemi- 
sphere) or anticyclonic eddies (AEs) otherwise. CEs (AEs) drive local upwelling 
(downwelling), leading to negative (positive) sea surface height (SSH) anomalies 
and sea surface temperature (SST) anomalies. The changes in SSH, SST, chloro- 
phyll concentration (CHL), and roughness caused by oceanic eddies can be recorded 
by altimeter, infrared, ocean color, and synthetic aperture radar (SAR) remote sens- 
ing, respectively. Accurate automatic eddy detection is crucial for monitoring the 
dynamics of mesoscale eddies on physical properties, transport, circulation, evolu- 
tion, decay, and their impact on other ocean processes. Oceanic eddy detection based 
on a variety of remote sensing data has been widely studied. 

Automatic eddy identification algorithms that developed based on altimeter SSH 
data can be divided into three categories: the physical-parameter-based method that 
includes the Okubo-Weiss parameter method [6, 27], the winding angle method [5, 
46], and the 2D wavelet method [10]; the flow-direction-based method [39, 48]; 
and the SSH-based method [7, 18, 38]. Another modern method that is based on 
the instantaneous Lagrangian flow geometry [1, 22-25, 40] is proposed to identify 
eddies in turbulent flows. Several eddy detection methods were developed based on 
satellite SST data. E.g., edge detection method [29], neural network-based method 
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[4], SST contour-based method [17], and velocity-geometry method [12], etc. Com- 
pared to satellite SSH data, CHL and SAR images have a high spatial resolution, 
which makes them effective sources for gaining more comprehensive and detailed 
information on mesoscale eddies in the oceans [7, 14, 15, 36]. However, eddy detec- 
tion based on CHL and SAR images is still in the stage of case study due to the low 
space-time coverage. In conclusion, existing eddy detection algorithms can detect 
major circular structures of mesoscale eddies, but more work is still to be done. On 
the one hand, eddy detection based on different remote sensing data has its own 
advantages and disadvantages. For instance, eddies may temporarily ‘disappear’ or 
cannot be detected due to noise and sampling errors of the altimeter SSH data, while 
eddy detection using SST is prone to false positives because many other ocean phe- 
nomena may impact SST. On the other hand, with the accumulation of remote sensing 
data, some algorithms lack computational efficiency due to contour iterations [49] 
or complex calculation processes [40]. 

Recently, deep learning (DL) [33] technology has exhibited state-of-the-art perfor- 
mance in mining the complicated rules hidden in multi-source ocean remote sensing 
images [26, 35, 52]. Moreover, in comparison with traditional statistical and machine 
learning methods, DL technology features a strong ability to learn and model com- 
plex relationships [28, 30, 43, 47, 51]. Therefore, it is natural to propose using the 
DL-based model to detect mesoscale eddies based on remote sensing images. Lguen- 
sat et al. [34] developed "EddyNet" that based on the encoder-decoder network U-Net 
to identify oceanic eddies in the southwest Atlantic. Franz et al. [20] also used the 
U-Net to detect and track oceanic eddies in Australia and the East Australia current 
regions. Du et al. [15] developed "DeepEddy" based on PCANet and spatial pyra- 
mid pooling to detect oceanic eddies based on SAR images. Xu et al. [50] applied 
the pyramid scene parsing network to detect eddies in the North Pacific Subtropical 
Countercurrent region. These regional studies proved that the DL-based model per- 
formed well in detecting mesoscale eddies in territorial seas. The DL-based model 
performance on the global mesoscale eddy detection remained unverified. Moreover, 
these works use one type of remote sensing data as input to detect mesoscale eddies. 

In order to solve the above problems, we propose a DL-based global eddy detection 
model based on the fusion of SSH and SST data in this study. The remainder of 
the study is organized as follows. Section 2 firstly illustrates a DL-based model to 
identify global mesoscale eddies based on satellite SSH data. Furthermore, Sect. 3 
shows a multi-model DL-based eddy detection model developed based on the fusion 
of SST and SSH data. Section 4 shows the characterization of global mesoscale 
eddies detected by the multi-model DL-based model. Finally, Sect. 5 summarizes 
the conclusions of our investigation. 
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2 DL-based Eddy Detection Model Based on SSHA Data 


2.1 Data 


The SSHA product is produced by Ssalto/Duacs and distributed by the Archiving, 
Validation, and Interpretation of Satellite Oceanographic (AVISO) and is available 
daily on 0.25 ° spatial resolution. The product is merged from all available altimeter 
missions, including TOPEX/Poseidon (TP), Jason-1&2, European Remote-Sensing 
Satellite (ERS)-1&2, Environmental Satellite (ENVISAT), Geosat Follow On (GFO), 
Cryosat-2, Saral/Altika, and Haiyang-2A, and covers the period from 1993 to the 
present. Since resolving oceanic mesoscale variability requires a minimum of three 
altimeter missions [32, 41, 42], only the period from 2000 onward meets the criterion. 


2.2 Method 


The DL-based eddy detection model is developed based on the U-Net architecture 
consisting of ResNet blocks, hereafter Res-UNet. Although developed initially for 
semantic segmentation of biomedical images [38], U-Net [19, 45] achieves success- 
ful applications in many fields. Fig. 1 shows the framework of the U-Net, which 
is consisted of the encoder-decoder module, bottleneck module, and concatenation 
module. The encoder module extracts information at different resolutions. The output 
module contains a convolutional layer and activation layer to yield class confidences 
at each pixel. 

The ResNet block is designed to deepen the network while alleviating the problem 
of network degradation. The input to the ResNet block, x,, is processed in two ways. 
A 3x3 convolution is used to obtain a direct linear mapping result; i.e., w, * x;, 
where w, denotes the convolutional filter. Meanwhile, x, is subjected to the following 
processes twice in sequence: batch normalization (BN), arectified linear unit (ReLU), 
and Conv2D. The ReLU layer is used to increase the nonlinearity. By adding the 
direct and residual mapping, the ResNet block combines deep-learning and shallow- 
learning features, meaning that it can extract more valuable information. The original 
information is maintained and passed by the process of linear mapping with a 3x3 
convolution, which reduces the possibility of degradation. 


2.3 Experiment and Performance 


The training and validation datasets of mesoscale eddies are generated automatically 
by using the SSH-based method [37], which is similar to the eddy identification 
method proposed by Chelton, et al. [7]. Mesoscale eddies from 2000-2013 and 2014- 
2015 are used as the training dataset and validation dataset. There are 5114 training 
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Fig. 1 Res-UNet based eddy detection model 


samples and 730 testing samples. Pixels in each sample are labeled as ‘1’, *-1’, and 
‘0’ inside anticyclonic eddies (AEs), cyclonic eddies (CEs), and background regions. 
The Res-Unet model is trained on an Nvidia GeForce RTX 2070 GPU card using 
ADAM optimizer [31] and mini-batches of 16 maps. An early-stopping strategy is 
used to stop the learning process when the validation dataset loss stops improving 
in five consecutive epochs. The implementation of our model is realized in Python. 
The Python interfaces are based on Keras framework [9] with TensorFlow backend. 
The dice loss function, which is widely used in segmentation problems, is the cost 
function. Given the predicted segmentation P and the ground truth region G, the dice 
coefficient is calculated as: 
2|PNG| 


Dicecoef(P, G) = P| + IGI (1) 


where l.l is the sum of elements in the area. A good segmentation result is explained by 
a dice coefficient that is close to 1. By contrast, a low dice coefficient (near 0) indicates 
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poor segmentation performance. A differentiable version of the above metric must be 
used to train deep neural networks. A soft Dice Coefficient was adopted in this work, 
and the output of the softmax layer was directly used to maximize loss calculations. 
The coefficient is given as: 


pee ee 


pi is the output of the softmax layer 1 for the correct class and otherwise set as 0. 
Finally, the Loss is calculated as: 


soft Dicecoef(P, G) = (2) 


Loss = 1 — soft Dicecoef (P, G) (3) 


The loss and accuracy of the Res-UNet model were about 14% and 94% when 
training using the ground truth dataset in the South China Sea (SCS) (Fig. 2). 

Therefore, the Res-UNet model is accurate and reliable enough to obtain mesoscale 
eddies in the global ocean. The global SSH and SST maps were firstly partitioned into 
several regional maps of 80x60 pixels, respectively. Then, applying the Res-UNet 
model to SSHA maps in the same space-time until all the regions have been detected. 
Finally, all the regions’ eddies were seamlessly merged to obtain a global eddy map. 
Figure 3 shows the mesoscale eddies identified by the Res-UNet model on January 
1, 2019. There are 3314 (2963 ground truth) AEs and 3407 (3056 ground truth) CEs 
in the global ocean. Compared to the SSH-based method, the accuracy of the Res- 
UNet based global eddy detection method is 93.79%, and the mean IoU is 88.86%. 
Figure 3 clearly shows that the Res-UNet model identified many more small-scale 
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Fig. 4 An Argo float (red line) is captured by an AE (blue line) that detected by the Dense-UNet 
model in the KE region and rotated with the AE on a May 19, 2014, b June 18, 2014, c July 18, 
2014, and d September 16, 2014 (the color denotes SSHA) 


eddies. Besides, it takes less than 1 minute for the Res-UNet model costs to identify 
eddies in the global ocean, while the SSH-based method costs more than 16 hours 
[37]. In conclusion, the Res-UNet model can identify many more small-scale eddies 
and significantly improve computational efficiency. 

Argo floats are associated with short repeating cycles, and they can observe 
mesoscale eddies in the global ocean. When trapped in an eddy, they show either a 
cyclonic or an anticyclonic trajectory. Therefore, the trajectory data of Argo floats 
are utilized to verify the accuracy of the Res-UNet model. In this chapter, the Argo 
float (2901556) is used to validate the results of the Res-UNet based eddy detection 
model. The Argo float was trapped in the AE and moved as a clockwise loop. Such 
a result is consistent with the concept that AEs rotate clockwise in the Northern 
Hemisphere (Fig. 4). 
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3 DL-based Eddy Detection Model Based on SSHA 
and SST Data 


In order to solve the problem that eddies may temporarily ‘disappear’ or cannot be 
detected due to noise and sampling errors of the SSHA data, SST data that can finely 
delineate the eddy structure are added to the model, to detect mesoscale eddies more 
accurately. 


3.1 Data 


The SST dataset is the NOAA Optimum Interpolation (OD SST product from 
Reynolds, et al. [44] on daily and 0.25 ° resolution. The OISST dataset is constructed 
from infrared satellite observations of the Advanced Very High Resolution Radiome- 
ter (AVHRR) with supplemental information provided by in situ observations and 
proxy SSTs computed from sea ice concentrations. Error fields were provided, show- 
ing an accuracy of about 0.1 °C on daily basis. The OISST dataset is available from 
1981 onward. 


3.2 Method 


The Dense-UNet model is comprised of a data fusion module and a feature extraction 
module (Fig. 5). Considering the complex nonlinear relationship between SST and 
SSHA within eddies, the layer-level fusion strategy is used to fusion SSH and SST 
data before the feature extraction. The layer-level fusion network can effectively 
integrate and fully leverage multi-modal images. Therefore, the data fusion model 
was developed based on the hyper-dense connectivity network [11] to integrate and 
fully leverage fused SSHA and SST images effectively. Satellite SST and SSHA 
data were imported into two streams, respectively. To better model relationships 
between SST and SSHA, dense connections, that use linear operations where every 
input is connected to every output by weight, were introduced into the model. Dense 
connections can relieve the vanishing gradient of networks, and reduce the parameters 
of deep networks [11]. Let x/ and x? denote the outputs of the / layer in SST 
and SSHA streams and H; is a mapping function composed of a convolution layer 
followed by a batch normalization and a ReLU activation function. The output of the 
IÈ layer in a given stream s can then be defined as: 


f f 1 2 1 2 1 2 
xi = Hf (Lar. Xj—1> X}_25 Xj-2,°°* » Xo» xo]) (4) 


Then, the fusion data x7 are used as input of the U-Net to detect mesoscale eddies. 
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Fig. 5 Dense-UNet architecture based on the fusion of SSHA and SST data 


3.3 Experiment and Performance 


The training and validation datasets of mesoscale eddies are generated automatically 
using the SSH-based method [22]. Mesoscale eddies during 2000-2013 are used as the 
training dataset, and mesoscale eddies during 2014-2015 are used as the validation 
dataset. There are 5114 training samples and 730 testing samples. Pixels in each 
sample are labeled as ‘1’, ‘-1’, and ‘0’ inside AEs, CEs, and background regions. To 
evaluate the performance of the Dense-UNet model, we identify mesoscale eddies 
in the Kuroshio Extension (KE) and the SCS. The dice loss function is used as the 
cost function. As shown in Table 1, the loss based on the SSHA is larger than that 
based on the fusion of SSHA and SST. On the contrary, the accuracy based on the 
SSHA is lower than that based on the fusion of SSHA and SST. 

The Dense-UNet model can be further verified by a case study of a CE in the 
KE (Fig. 6). On November 22 and 23, 2013, the CE identified by SSHA split into 


Table 1 The loss and accuracy of the Dense-UNet model of different testing datasets in different 
ocean regions 


Region Dataset Model Dice loss Accuracy 

SCS SSHA Res—UNet 0.1455 0.9131 
Fusion dataset Dense—UNet 0.0869 0.9490 

KE SSHA Res—UNet 0.1637 0.9480 


Fusion dataset Dense—UNet 0.1183 0.9640 
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Fig. 6 Variations of a CE in the KE at different time during the evolution. The black line represents 
the eddy identified by SSHA, while the purple line represents the eddy identified by the fusion of 
SST and SSHA 


two CEs, while the CE identified by the fusion of SST and SSHA was consistent 
with the negative area of SSHA. From December 28, 2013 to January 13, 2014, 
the CE identified by SSHA did not cover the negative area of SSHA, while the eddy 
boundary identified by the fusion of SST and SSHA completely covered the negative 
area of SSHA. Therefore, it can be indicated that the fusion of SSHA and SST data 
enhances the accuracy and robustness of eddy detection and can also ensure eddy 
tracking’s continuity and accuracy. 

In this section, we propose the Dense-UNet method to identify oceanic mesoscale 
eddies. Compared to the methods that identify eddies based on one kind of remote 
sensing images, Dense-UNet detect eddies based on the fusion of SSHA and SST 
data. Using the Dense-UNet model, we perform a comparison experiment using 
SSHA data and fusion data in the SCS and KE regions, respectively. As a result, the 
Dense-UNet model achieves impressive detection performance based on the fusion 
data. The model not only improves eddy detection accuracy and efficiency but also 
gives a novel viewpoint on exploring the relationships between marine environmental 
variables and mesoscale eddies. 
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4 Characterization Analysis of Mesoscale Eddies in the 
Global Ocean 


4.1 Spatiotemporal Distributions of Eddies in the Global 
Ocean 


Based on the Dense-UNet model, mesoscale eddies were identified based on 23-year 
satellite SSHA and SST data in the global ocean during 1993-2015. In this study, the 
research focused on eddies with amplitudes greater than 2 cm and sea surface radii 
larger than 35 km, which was based on consideration of the resolution and precision 
of the SSH product [16]. Besides, we only consider eddies located in areas where 
water depths are greater than 200 m to minimize the impacts of data errors near the 
coastal shallow water region. An average of 4,100 mesoscale eddies were identified 
daily in the global ocean during the period 1993-2015. The frequency of eddies for 
a given geographic resolution (0.25 ° latitude by 0.25 ° longitude) was defined as an 
F-number for simplicity: 


deddy 
F(%) = ddy 


(5) 


total 


where deaay means the days that mesoscale eddies appeared, and dor, represents 
the total number of observation days. In other words, high F-numbers imply a high 
intensity of eddy activity and vice versa. The seasonal variability for AEs and CEs 
in the global ocean is similar (Fig. 7a-b). In the Southern Hemisphere (SH), eddy 
activity is weak in the austral summer (December—February, DJF) and fall (March— 
May, MAM), but intensive during the austral spring (September-November, SON) 
and winter (June-August, JJA), and vice versa in the Northern Hemisphere (NH). 

Figure 7c-d shows the spatial distribution of mesoscale eddies in the global ocean. 
Mesoscale eddies with lower FF-number were distributed in tropical waters. On 
the contrary, mesoscale eddies with higher F-number were widely distributed in 
the middle latitude regions, including the Kuroshio Extension region, the Agulhas 
Current, the Gulf Stream, the Agulhas Return Current, the East Australia Current, 
and the Antarctic Circumpolar Current, etc. Besides, CE activities are more intensive 
than AEs in the Western Boundary Current regions. In general, the spatial distribution 
of global eddies detected in this study has good consistent with previous literature 
[7, 18, 49]. 


4.2 Long-term Variations in Derived Eddy Parameters 


The long-term variations in annual mean eddy properties (eddy number, radius, 
amplitude, and rotational speed) are shown separately for AEs and CEs in the NH and 
the SH. The eddy number is the annual mean eddy census per day, and the percentage 
represents the ratio of the annual mean abnormal eddy census per day to the total 
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Fig. 7 Spatiotemporal distribution of the F-number of mesoscale eddies in the global ocean from 
1993-2015. The graphs and maps show meridional variation and spatial distribution of the F-number 
of AEs (a, c), CEs (b, d). MAM represents March—May, DJF represents December—February, JJA 
represents June—August, and SON represents September-November. The image resolution is 0.25 ° 
by 0.25° 


number of eddies. The eddy radius is the distance from its center to the outermost 
SSH contour with the maximum average geostrophic speed (U). U = vu? + v2, 
where u and v are the zonal and meridional components of the geostrophic velocity 
anomaly, which can be computed from the SSH gradients: 


H 
a E 
y 
g ðSSH 
u= s m 


where g is the acceleration due to gravity; dx and dy are the eastward and northward 
distances, respectively; and f is the Coriolis parameter. Eddy kinetic energy is given 
as (EKE)= 4 (u? + v). The amplitude is the difference in SSHA between the eddy 
core and boundary. The rotational speed is the maximum of the average geostrophic 
speed around all of the eddy’s closed SSHA contours. 

About 2100 CEs and 2000 AEs formed per day as detected by the Dense-UNet 
eddy detection model for each global SSHA map. This is close to the result of 
Faghmous et al. [18], which identifies approximately 2300 CEs and 2300 AEs for 
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Fig. 8 Variations in annual mean parameters of mesoscale eddies in the global ocean from 1993 
to 2015. Eddy number a, eddy radius b, eddy amplitude c, rotational speed d, and EKE e. u 
(dotted line) and o (shading) are mean values and one standard deviation of the annual mean eddy 
parameters 


each daily SSHA snapshot. The slight difference in eddy number between the two 
eddy datasets is possible because there is no limit to the amplitude of eddies in 
Faghmous et al. [18]. There were no significant decreasing and increasing trends 
in the annual mean eddy parameters for both AEs and CEs during the 1993-2015 
period, and the annual mean eddy parameters for eddies in the NH and the SH are 
different (Fig. 8). Eddy numbers in the SH are twice as much as that in the NH, which 
is consistent with the result in Fig. 7a-b. The annual mean radius for AEs are slightly 
larger than that of CEs, which is about 87.0 km and 86.0 km, respectively. The annual 
mean amplitude of the CEs is larger than that of the AEs in both hemispheres, and 
annual mean eddy amplitude in the SH is larger than that in the NH. The annual mean 
eddy amplitude of AEs and CEs in the NH (SH) is 6.14 (6.7) cm and 6.29 (7.37) cm, 
respectively. The difference between AEs and CEs on amplitude is expected from 
the gradient wind effect of centrifugal force that pushes fluid outward in rotating 
eddies [21], thus intensifying the low pressure at the centers of CEs and weakening 
the high pressure at the centers of AEs [7]. 
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Similarly, the annual mean eddy rotational speed and EKE of CEs are also larger 
than that of AEs in their respective hemispheres since they were derived from eddy 
amplitude. However, the annual mean rotational speed and EKE of eddies in the NH 
are larger than in the SH. The annual mean eddy rotational speed of AEs and CEs in 
the NH (SH) is 19.73 (18.19) cm/s and 20.91 (19.84) cm/s, respectively. The annual 
mean EKE of AEs and CEs in the NH (SH) is 152.95 (113.41) cm/s and 190.00 
(145.08) cm/s, respectively. 


5 Conclusions 


This chapter elaborated on how to apply deep learning technology to global mesoscale 
eddy detection. We first developed a deep learning-based eddy detection model based 
on SSHA data. The model consists of U-Net and ResNet blocks, called Res-UNet. 
The Res-UNet was applied to detect mesoscale eddies in the global ocean. Argo 
floats data are used to verify the Res-UNet model. The Argo float was trapped in 
the AE and moved as a clockwise loop. Such a result is consistent with the concept 
that AEs rotate clockwise in the Northern Hemisphere. Compared to the traditional 
eddy detection methods, the Res-UNet eddy detection model can accurately identify 
mesoscale eddies and significantly improve computational efficiency. Such a result 
proves that deep learning technology has strong learning abilities and can better use 
datasets for feature extraction. 

Considering that eddies may temporarily ‘disappear’ or cannot be detected due to 
noise and sampling errors of the SSHA data, the study further develops a multi-modal 
deep learning model—Dense-UNet model to detect mesoscale eddies based on the 
fusion of SSHA and SST data. The Dense-UNet model extracts SSHA information 
for determining eddy locations and withdraws SST information to supplement and 
confirm eddy features embodied in SSHA data. The results show that the fusion of 
SSHA and SST data enhances the accuracy and robustness of eddy detection and 
can also ensure eddy tracking’s continuity and accuracy. Based on the Dense-UNet 
eddy detection model, mesoscale eddies are detected based on satellite SSHA and 
SST data in the global ocean from 1993-2015. The analysis of the spatiotemporal 
distribution of the 23-year global eddy dataset revealed that eddies were concen- 
trated along western boundary currents. Mesoscale eddies are active in winter in the 
North Hemisphere and vice versa in the Southern Hemisphere. The spatiotemporal 
distribution of eddies detected by the Dense-UNet model is in good agreement with 
previous studies, thus further validating the model’s accuracy. 

The long-term variations in annual mean eddy properties (eddy number, radius, 
amplitude, and rotational speed) are analyzed separately for AEs and CEs in the 
Northern and the Southern Hemisphere. There were no significant decreasing and 
increasing trends in the annual mean eddy parameters for both AEs and CEs dur- 
ing the 1993-2015 period, but the annual mean eddy parameters for eddies in the 
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Northern Hemisphere and the Southern Hemisphere are different. Eddy numbers in 
the Southern Hemisphere are twice as much as that in the Northern Hemisphere. 
The annual mean radius for AEs is slightly larger than that of CEs in both hemi- 
spheres. The annual mean amplitude of the CEs is larger than that of the AEs in both 
hemispheres, and the annual mean eddy amplitude in the Southern Hemisphere is 
larger than that in the Northern Hemisphere. The annual mean eddy rotational speed 
and EKE of CEs are also larger than AEs in their respective hemispheres. However, 
the annual mean rotational speed and EKE of eddies in the Northern Hemisphere 
are larger than that in the Southern Hemisphere. The difference in eddy parameters 
between the two hemispheres is caused by the different generation mechanisms of 
mesoscale eddies, which deserves further study. In conclusion, the study extends the 
usage of satellite remote sensing big data, enriches the application of deep learning 
technology in oceanography, and promotes multidisciplinary research in this aspect. 
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1 Introduction 


Flooding is a severe natural disaster. It can be caused by various reasons. In the 
coastal areas, tropical cyclone-induced coastal flooding is the combined effect of 
storm surge-caused sea water inundation and rainfall-induced freshwater flooding. 
If tropical cyclone-induced flooding occurs at the same time as the rainy season, the 
consequences will be even more serious. If flooding occurs in locations with dense 
populations and large cities, it will result in huge loss of life and property [36]. For 
example, on August 26-28, 2017, Harvey lingered over the Houston area, a densely 
populated place, dumping massive amounts of rain. There were over 80 fatalities as 
a result of the extraordinary flooding [33]. Harvey produced over 125 billion dollars 
in damage, according to the National Hurricane Center. 

Coastal flooding may become considerably severe in the future as a result of 
climate change and anthropogenic activities. For starters, greater temperatures may 
lead to more moisture in the atmosphere, enhancing the intensity of the flood [34]. 
Climate warming had increased the average and extreme rainfall of storms Katrina, 


B. Liu 
College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, China 


B. Liu - G. Zheng 
State Key Laboratory of Satellite Ocean Environment Dynamics, Second Institute of 
Oceanography, Ministry of Natural Resources, Hangzhou 310012, Zhejiang, China 


B. Liu 
Key Laboratory of Marine Ecological Monitoring and Restoration Technologies, 
Ministry of Natural Resources, Shanghai 200137, China 


X. Li (BS) 

CAS Key Laboratory of Ocean Circulation and Waves, Institute of Oceanology, 
Chinese Academy of Sciences, Qingdao 266071, China 

e-mail: lixf@qdio.ac.cn 


© The Author(s) 2023 227 
X. Li and F. Wang (eds.), Artificial Intelligence Oceanography, 
https://doi.org/10.1007/978-98 1- 19-6375-9_11 


228 B. Liu et al. 


Irma, and Maria, according to Patricola and Wehner’s simulations [23]. Human activ- 
ities, according to Bilskie et al. [2], can exacerbate the impact of coastal inundation 
on infrastructure. The study [38] found that urbanization worsened both the flood 
response and the total rainfall from hurricanes. These studies should raise our aware- 
ness of increased flooding in highly urbanized and densely populated coastal areas, 
both in developed and developing countries. 

Accurate flood mapping can help emergency managers create more focused dis- 
aster response strategies, as well as researchers better understand flooding dynamics 
and study on more accurate forecasting methods. Ground surveys or information 
retrieval from remote sensing imagery can be used to map floods. Ground surveys 
are direct and exact, but they are expensive, and certain regions are inaccessible to 
humans after flooding. Flood mapping from remote sensing data is a means of low 
cost, and it could map areas human cannot access. The first remote sensing data 
source is optical data. The optical images are easy for human to interpret and then 
use. Optical sensors do not work at night and cannot see through cloud. This limits 
the applicability of optical remote sensing in information extraction during flooding. 
The second data source is radar remote sensing, especially the synthetic aperture 
radar (SAR) remote sensing with the ability of providing high-resolution images. 
SAR is an useful remote sensing tool for flood mapping since it can imaging floods 
at any time of day or night and in almost any weather condition. This ability is espe- 
cially useful for mapping the dynamic flooding to understand flooding mechanisms 
and provide disaster relief plans. 

Traditional flood mapping techniques using SAR data rely on image process- 
ing techniques that use backscattering, statistical, and polarimetric information. 
These methods include histogram thresholding [3], active contour segmentation 
[13], region growing [21], change detection [9, 20], statistical classification [10], 
neuro-fuzzy classification [6], multi-temporal statistics [4], pixel-based supervised 
[35], and object oriented rule-based classification [25]. Although traditional meth- 
ods have achieved good results in some cases and some of them are even used in 
practical applications, they mine multi-dimensional SAR data using human-crafted 
features and rules to achieve flood mapping. It is difficult for human-crafted features 
and rules to guarantee stable performance under a variety of influences, including: 
(1) speckle; (2) temporal mis-registration; (3) imaging system parameters [22]; (4) 
meteorological factors [12, 18]; and (5) environmental conditions. 

Deep learning (DL) technology, particularly deep convolutional neural network 
(DCNN) models, offers a promising route for reliable flood mapping. Instead of 
being pre-defined, the features for reliable flood classification in the DCNN models 
are mined from the multi-dimensional SAR data directly. These data-driven models 
are capable of offering reliable characteristics under a variety of influencing condi- 
tions, and they are optimized from data to information in an end-to-end style. This 
concept has been proven in a variety of communities, including computer vision [29], 
biomedical image processing [7] and geoscience [15, 26, 39]. DCNN-based methods 
for flood mapping have been proposed recently. Kang et al. [14] demonstrated that a 
fully convolutional network, which is one type of DCNN model, can produce more 
precise flooding mapping than previous approaches. Rudner et al. [28] presented a 
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DCNN-based method for retrieving flooded built-up areas that shows promise. We 
[18] presented an modified DCNN method for coastal flooding mapping from multi- 
temporal dual-polarimetric SAR data that offers reliable results, and this method is 
suitable for spatial and temporal investigation of storm-caused coastal flooding. For 
flooding mapping in built-up areas from high-resolution SAR imagery, Li et al. [16] 
presented an active self-learning DCNN model. 

We believe the DCNN models can overcome the difficulties of robust flooding 
mapping, based on our past research [18, 19]. The DCNN-based SAR coastal flood- 
ing mapping network (SARCFMNet) is described in this chapter. SARCFMNet is a 
model designed specifically for coastal flooding mapping. It has two improvements 
that increase accuracy and robustness: (1) the physics-aware input information design 
fuses temporal and polarimetric information for more reliable mapping and integrates 
radar remote sensing mechanisms of flooding extraction into DCNN; (2) the regu- 
larization scheme useful for fully-convolutional networks enhance the model’s relia- 
bility. The SARCFMNet was trained and tested using a dataset of coastal flooding in 
Houston, Texas, induced by Hurricane Harvey in 2017. The flooded regions, which 
cover around 4000 km?, are delineated and studied in these images. The contributions 
of this study are listed as follows: 


Compared to the commonly used, benchmarking DCNN approach, the SAR- 
CFMNet performs better and is more stable. This demonstrates that the design 
of physics-aware input information and the regularization scheme can improve 
the performance and reliability. 

The spatial and multi-temporal characteristics of the Harvey-caused inundation 
are investigated using the mapping results. 

The wind influence is revealed, implying that DCNN models considering wind 
impact could improve reliability in practice. 

The cost-sensitive losses for DCNN models are investigated, which might be ben- 
eficial for more adaptive models that take performance costs into account. 

The trained and tested SARCFMNet model is applied to Bangladesh, which is one 
of the United Nations (UN)-defined least developed countries, to get nation-level, 
multi-year, high-temporal-resolution flooding maps. This can help us get deeper 
understanding of the flooding mechanism of this country. 


The chapter is organized as follows. In Sect. 2, we will introduce the dataset used 
for the model training and testing. The model is described in Sect.3. In Sect. 4, 
the model performances are presented. In Sect.5, the multi-year flooding maps of 
Bangladesh are given with discussions. Section 6 concludes this chapter. 
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2 Dataset 


2.1 Data Description 


The dataset used for the training and testing of the SARCFMNet model is collected 
from Sentinel-1 SAR data during the Hurricane Harvey. Around the end of August 
2017, Hurricane Harvey caused damage on the Houston region. Six pairs of Sentinel- 
1 SAR images were obtained in the research place during this time. The images are 
with VH and VV polarizations. One pair is from the Stripmap (SM) mode, while 
five pairs the Interferometric Wide (IW) swath mode. The products of Ground Range 
Detected are utilized. Table 1 lists the data parameters in the dataset. The IW01 pair’s 
post-event image is impacted by strong wind. Harvey had degraded to a Topical 
Storm by the time this image was taken, but it still delivered powerful winds to the 
scene, with the speed of around 20m s7! [31]. We labeled the flooded regions as 
the ground truth using land-cover categories from Google Earth and OpenStreetMap 
and Copernicus Emergency Management Service Rapid Mapping products [5]. 

In Fig. 1, we give a visual illustration of one pair from the data constructing the 
dataset, the SMO! pair. In this figure, the first and second rows show the images 
of the VV and VH. In these two rows, the first and second columns show pre- and 
post-event images respectively. The OpenStreetMap and Google Earth image of the 
region are in the third row. Houston’s western and southern areas are covered by the 
SMO! pair. 


Table 1 Descriptions of the image pairs for generating the dataset in this study 


Pre-event time (MM- Post-event time (MM- Coverage (Up left | 
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Fig. 1 Illustration of one image pair constructing the dataset 
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Fig. 2 Flowchart of data preprocessing 


2.2 Data Preparation 


The original SAR images are processed in the following steps, as illustrated in Fig. 2, 
to construct the dataset for model training and testing. 


1. Application of orbit file: The accurate satellite orbit files are applied for the SAR 
products. 

2. Filtering with sliding windows: To lessen the impact of speckle on SAR images, 
a filter is performed. 

3. Radiometric calibration: After this calibration, the pixel values of the SAR images 
represent the back-scattering information (°). 

4. Conversion to dB: The linear scale o? is converted to decibel (o8). The normal- 
ized radar cross section (NRCS) images in dB are generated. 

5. Terrain correction: The SAR images representing the oh information are 
geocoded into a geographical coordinate system, which is the World Geodetic 
System 84. The ocean is masked out with the digital elevation model (DEM) 
information. After this, each pixel occupies 8.9832 x 1075 degrees. 

6. Subset generation: The pre- and post-event images are transformed into the same 
coordinate system for each pair of data used to create the dataset. We trimmed 
the subsets from the pre- and post-event images by the same coverage. 


The pre- and post-event images are geometrically matched after the preprocess- 
ing. We cut each pair into 256 x 256, non-overlapping samples with pre- and post- 
multiple channels. For all the pairs, the sample numbers are shown in Table 1. 


3 Model 


The SARCFMNet model is specially tailored from the U-Net model [27] for its 
verified effectiveness. The U-Net model was proposed for biomedical image seg- 
mentation. Its architecture was created so that it could function with less training 
samples while still producing precise segmentations. The proposed SARCFMNet 
is shown in Fig. 3a. The network can be divided into two paths. The left path is an 
encoding path to extract abstracted features for accurate classification with down- 
sampling stage by stage. The right path is a decoding part to up-sample the feature 
maps. There are skip connections from the encoding to the decoding path to pro- 
vide latter the high-resolution features via concatenation. As illustrated in Fig. 3a, 
the encoding phase consists of 3 x 3 convolutions activated by the rectified linear 
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Encoder module: it is composed of convolutional layers and pooling layers, used for information extraction, and 
repeated for several times. 

Decoder module: it is composed of upsampling/trans-convolutional layers and convolutional layers, used for 
resolution Restoration, and repeated for several times. 


g Bottleneck module: it is composed of convolutional layers. 


[e] Output module: it is composed of convolutional layers, regularization layers, and pixel-wise classification layers. 


Concatenation module: higher resolution image information is concatenated into lower resolution image 
information. Multi-scale information is fused. 


Fig.3 The proposed model design. a The proposed SARCFMNet model structure. b The abstracted 
model design can be generalized to multiple ocean remote sensing image information mining 
problems 
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unit (ReLU) and 2 x 2 max-pooling. 3 x 3 convolutions with ReLU activation and 
2 x 2 up-sampling operations constitute the decoding part. The output layer of the 
model is a Sigmoid-activated 1 x 1 convolution. Thus, this model can predict the 
probability of each pixel as flooding. The loss function of the model is the binary 
cross-entropy (BCE) loss [17]. Pixel-wise classification accuracy is used as metric 
to evaluate model performance. 

In the SARCFMNet model, there are two specially-tailored modifications designed 
for coastal flooding mapping. 


1. Physics-aware input information design—Defined by the problem, we design 
different input information combinations as in Fig. 3a. For the ø, its superscript 
indicates pre- or post-event information, and its subscript shows polarization. 
Bi-temporal information with pre- and post-event information from one single 
polarization is often used in flooding mapping [3, 10, 14]. This is a direct design. 
In this study, based on the radar remote sensing physics, we propose that the VV 
and VH polarization information should be fused, since the two polarizations can 
compensate each other. In addition, we propose the temporal difference images 
should also be used. From Sect. 2.2, we know that the preprocessed images rep- 
resent the backscattering information in the log-scale. Therefore, the temporal 
difference images a — oly and a — obp represent the log-ratio informa- 
tion for VV and VH, respectively. From the previous studies [1], we know the 
log-ratio is useful for SAR image change detection. Based on radar remote sens- 
ing physics knowledge, the SARCFMNet model fuses temporal, log-ratio, and 
polarization together, denoted as DUAL+Diff. This approach can increase the 
accuracy and reliability of the DCNN model, making it appropriate for coastal 
flooding mapping from SAR remote sensing data. The fused input information 
sources are integrated as a data cube. This design can realize information fusion 
with little parameter increasing. 

2. DCNN-suitable regularization design—For DCNN models, such as the proposed 
SARCFMNet, the models’ ability to generalize is limited by model overfitting. 
When a model overfits, it might produce excellent results during the training 
phase but bad results when used in practice. Dropout is a suitable scheme to 
avoid overfitting for fully connected networks, although it is not so helpful for 
convolutional layers [8]. From the network design, we can find out there are 
no fully connected layers in the model. This is a fully convolutional model. In 
this case, we should use a dropout means which is effective for convolutional 
layers. Here, we include the SpatialDropout2D (SD2D) layer to leverage channel- 
level dropout to accomplish regularization and increase the model’s generalization 
ability, as inspired by Tompson et al. [30]. 


The model can be generalized for multiple problems. The model can be abstracted 
as a design in Fig. 3b. There are five modules: (1) module 1 for encoding; (2) module 2 
for decoding, module; (3) module 3 for generating high-level bottleneck features; (4) 
module 4 for outputting predictions with adaptive processes; (5) module 5 for fusing 
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feature information between skip connections. With suitable information input and 
specially-tailored modifications, this abstracted model can realize multi-task pixel- 
level ocean remote sensing image information mining [15]. 


4 Performance Evaluations and Discussions 


4.1 Performance Evaluations 


The six image pairs yield 10049 samples, as shown in Table 1. We randomly choose 
roughly 20% of the samples in each pair to create a sub-dataset with 2000 samples. 
This dataset is denoted as the S2000 dataset. The S2000 dataset is used for model 
training. During the training process, 70% (1400 samples) are randomly selected 
for training, and the other 30% (600 samples) are selected for validation. There is 
a hyperparameter for the SpatialDropout2D layer, that is the dropout rate. We set 
the dropout rate as 0.5. Thus, the results from the model with the SpatialDropout2D 
can be identified by _SD2D0.5. The model training and testing are implemented by 
the software framework Keras. The optimizer for the model fitting is Adam. The 
batch size is 32. The total number of epochs is 300. The validation set determines 
the model parameters. We use one Nvidia GeForce GTX 1080Ti graphics processing 
unit (GPU) card. The training time on the S2000 dataset is about 6.7 hours. 

The losses and accuracies for the training and validation are documented and 
analyzed. The readers can find the details in [18]. The conclusions drawn from the 
analyses are listed here. 


1. In all the settings, the models are fully trained. With the indication of validation 
loss, the models try not to overfit. 

2. The usefulness of the log-ratio information—From the performance comparison, 
the inclusion of the log-ratio information can improve the model’s performance 
for coastal inundation mapping. 

3. The usefulness of the dual-polarization fusion—From the performance compar- 
ison, the fusion of the polarization information can improve the model’s perfor- 
mance for coastal inundation mapping. In addition, the VH polarization can get 
better performance than the VV polarization. The possible reason is that VH is 
less sensitive to the wind condition during the flooding mapping. We will discuss 
this later. 

4. The usefulness of the DCNN-suitable regularization design—With the regular- 
ization layer suitable for the fully convolutional model, although the performance 
decreases in the training processing, the performance increases in the validation 
process. This indicates the regularization design can make the model more reli- 
able. 


On the dataset created in Sect. 2.2, the SARCFMNet trained on the S2000 dataset 
is applied. The results are given in Table 2. The input data and regularization scheme 
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Table 2 The trained SARCFMNet model’s performance on the dataset 
VV+Diff VH+Diff DUAL+Diff DUAL+Diff_SD2D0.5 
(Accuracy (Accuracy (Accuracy (Accuracy 
Recall R Recall Recall 


Precision Precision Precision Precision 


F1 score) F1 score) F1 score) F1 score) 


0.9677 0.9651 0.9754 0.9810 
vad 0.8045 0.7226 0.8388 0.8849 
0.8922 0.9489 0.9313 0.9393 
0.8461 0.8204 0.8826 0.9113 
0.9861 0.9897 0.9916 0.9907 
RO 0.8947 0.7831 0.9087 0.9172 
0.7970 0.9647 0.8932 0.8678 
0.8430 0.8645 0.9009 0.8918 
0.9350 0.9685 0.9684 0.9741 
IW03 0.6395 0.8996 0.9184 0.9541 
0.6507 0.7889 0.7793 0.8025 
0.6451 0.8406 0.8432 0.8718 
0.9817 0.9932 0.9904 0.9912 
EU 0.9143 0.8529 0.9301 0.8721 
0.6593 0.9324 0.8066 0.8598 
0.7661 0.8909 0.8640 0.8659 
0.9647 0.9719 0.9728 0.9824 
T 0.8888 0.9151 0.9641 0.9397 
0.6193 0.6763 0.6718 0.7790 
0.7300 0.7778 0.7918 0.8518 
0.9866 0.9823 0.9912 0.9915 
Sao 0.8580 0.6750 0.8816 0.8909 
0.8867 0.9907 0.9492 0.9467 


0.8029 
0.9681 0.9771 0.9800 

0.8306 0.8219 0.9098 0.9118 
0.7363 0.8649 0.8230 0.8575 
0.7737 0.8313 0.8583 0.8815 


Weighted Average 


by sample numb: 


proportion 


B. Liu et al. 


are indicated by the column names. The subset names are indicated by the row names. 
There are four numbers in each block of the table. Classification accuracy, recall, 
precision, and F1 score are listed in that sequence. The ratio of true positives to the 
total number of true positives and false negatives is recall. A greater recall score 
indicates that the model misses fewer areas that are actually flooded. The ratio of 
true positives to the sum of true positives and false positives is precision. A greater 
precision score indicates that the model is less likely to produce incorrect flooding 
areas. The F1 score is the harmonic mean of precision and recall, and it leans to 
the lower value within precision and recall. The weighted average is shown in the 
last row. The number of samples in each subset determines the weights. The best 
accuracy and F1 are emphasized by underline. From observation, the block with the 
best accuracy has the best F1 score. From this table, we can draw the consistent 


conclusions as shown above: 
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Before Hurricane Harvey VV 2017-08-05T00:26UTC After Hurricane Harvey VV 2017-08-29T00:26UTC 


NRCS (6B) 
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NRCS (3B) 
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NRCS (dB) 


1 
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Fig. 4 Visual evaluation on the [WO1 subset. a and b are the pre- and post-event NRCS images for 
the VV. c and d are the pre- and post-event NRCS images for the VH. e Ground truth. f Mapping 
prediction with the model (DUAL+Diff_SD2D0.5) 


1. The fusion of dual-polarization information improves the model’s coastal inun- 
dation mapping performance. 

2. The model gets better performance on VH polarization than VV polarization. 

3. The DCNN-suitable regularization improves performance and robustness of the 
model. 


The visual evaluation on the IWO1 subset, used as an example for presentation, 
is shown in Fig.4. In this figure, the first and second rows show the images of 
the VV and VH. In these two rows, the first and second columns show pre- and 
post-event images, respectively. The ground truth and flooding prediction using the 
model (DUAL+Diff_SD2D0.5) are shown in Fig. 4e, f. By comparing Fig. 4e, f, we 
can observe that the mapping prediction is very close to the ground truth, indicating 
that the presented model is effective. 
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4.2 Spatial and Temporal Characteristics 


After we apply the trained SARCFMNet model on the dataset described in Sect. 2, we 
can analyze the spatial and temporal characteristics of 2017 Harvey-induced coastal 
inundation. 

The image pair SMO1 is selected as a case for the geospatial analysis. The predicted 
flooding mapping is shown in Fig. Sa. In order to perform the geospatial analysis, we 
collect useful supporting data. They are shown in Fig. 6. The supporting data include: 
(1) The elevation data of the scene, from the United States Geological Survey (USGS) 
National Elevation Dataset [32], and shown in Fig. 6a; (2) The land cover types of 
the scene, from the 2016 National Land Cover Database (NLCD) [37], and shown 
in Fig. 6b with legend; (3) the historical water occurrence data of the scene, from the 
Global Surface Water Mapping Dataset (1984—2015) [24], and shown in Fig. 6c. 

We can derive certain geospatial analytic findings with the mapping predictions 
and supporting data: 


1. General analysis—In this scene, the total flooding area is about 284 km? (about 
3% of the scene). We use a disk-shape average filter (radius = 100 pixels) to 
process the flooding map, and create a flooding heat map for the scene. In Fig. 5b, 
the heat map is placed onto the pre-event image. In the southern part of Houston, 
severely flooded regions are densely scattered, as shown in the heat map. 

2. Relation with elevation—The elevation distribution of the flooded and non- 
flooded areas in the scene is analyzed and shown in Fig. 7a. It shows the elevation 
distribution of the flooded areas is different from that of the non-flooded areas, and 


Predicted coastal flooding 


Coastal flooding 
B ek 


heatmap 


NF 


30°N 


4100 km 


Label 


0 29°N 


96°W 40° 20° 20° 


Fig. 5 Subset SMO1 for the geospatial analysis. a Coastal inundation prediction from the SAR- 
CFMNet model. b Inundation heat map generated from the prediction 
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Fig. 6 Supporting data for the spatial analysis. a The elevation data of the scene, from the United 
States Geological Survey (USGS) National Elevation Dataset; b The land cover types of the scene, 
from the 2016 National Land Cover Database (NLCD); ¢ the historical water occurrence data of 
the scene, from the Global Surface Water Mapping Dataset (1984-2015) 


the former is obviously lower. It is more likely that flooding occurs and remains 
in lower regions, in this scene, the southern part. 

3. Relation with land cover types—The proportion of land cover types affected by 
the flooding is illustrated in Fig. 7b. It demonstrates that pasture and cultivated 
crops are the dominant land cover types in flooded regions. They account for more 
than 76% of flooding. They are the main land cover types in the southern part 
which is severely flooded. The flooding may severely damage local agriculture. 
However, we have to realize that even if the hurricane caused severe flooding 
in the urban areas, the inner city flooding cannot be easily extracted by pure 
image-based analysis. 
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Fig. 7 Geospatial analysis for the SMO1 subset. a The elevation distribution of the flooded and 
non-flooded areas. b The proportion of land cover types affected by the flooding. c The historical 
water occurrence of the flooded areas 


4. Relation with historical water occurrence—The flooding is extracted from ana- 
lyzing the pre- and post-event images. We have to be sure that the flooding is not 
caused by seasonal or periodic surface water increasing. For the flooded areas, 
the historical water occurrence is analyzed and shown in Fig. 7c. It reveals that, 
in flooded areas, the historical water occurrence is extremely close to zero. It 
signifies that the predicted flooding is abnormal, and people should be alert to it. 


For SMO1, IW01, IW04, and [WO5, the mapping products have an overlapping 
region. The multi-temporal study of the mapping results will be performed in this 
region. Figure 8 depicts the temporal analysis. Figure 8a shows a Moderate Reso- 
lution Imaging Spectroradiometer (MODIS) image of Harvey in August 26, 2017. 
The region for temporal analysis is illustrated as the green rectangle. The flooding 
duration probability of the overlapping zone is shown in Fig. 8b. It can assist us 
in comprehending the temporal evolution of floods. The locations with the highest 
probability are likely to be the last to vanish. Pixels with a probability < 0 lack all 
of the mapping products needed for temporal analysis. 
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Fig. 8 Temporal analysis of SARCFMNet-generated flooding maps of IW01 (August 29), [WO5 
(August 30), SMO1 (September 4), and IW04 (September 5). a A Moderate Resolution Imaging 
Spectroradiometer (MODIS) image shows Harvey in August 26, 2017. b The flooding duration 
probability of the region, which is illustrated as green rectangle in a. c Temporal flooding transi- 
tion from TWO! to IW05 of the region, which is illustrated as red rectangle in b. d flooding area 
proportions 


Figure 8d shows flooding area proportions of the product sequence, IW01 (August 
29), IW05 (August 30), SMO1 (September 4), and [W04 (September 5). It shows how 
the flooded areas in the region reduce over time as the product sequence progresses. 
We may calculate that the shrinkage rate is around 1% of the region area (roughly 
23 km?) each day using regression analysis. 

We discover a phenomenon of delayed flooding after the examination of the 
product sequence. One area does not show flooding in IW01 (August 29), but shows 
flooding in IWO5 (August 30). In Fig. 8b, a region is marked in a red rectangle. In 
Fig. 8c, the temporal flooding transition from IW01 to [WO5 is examined . The region 
is in Glen Flora, Texas . We check the news [11] and discover that the Colorado River 
(Texas) began flowing through and across the region on the evening of August 29 
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(local time). This is why the flooding is captured by [WO5 (sensing time: 12:22 UTC, 
August 30), but not by IW01 (sensing time: 00:26 UTC, August 29). The reason for 
the delayed flooding deserves further studying. 


4.3 Discussions of Performance 


In this part, we discuss the performance of the proposed model in two aspects: first, 
the influence of wind; and second, cost-sensitive losses to adjust the performance. 

The influence of wind is a factor seldom discussed in flooding mapping. However, 
in the case of storm-induced coastal flooding mapping, this is a practical issue. In 
order to map coastal flooding, we may encounter the following scenario: the storm 
has already generated coastal flooding, which is captured by remote sensing data; 
nevertheless, the storm has not yet left the scene and is still delivering strong winds. 
In this case, the wind can have adverse effects on inundation mapping, since the 
strong wind increases the water areas’ backscattering. In this study, we also face 
this situation. Strong wind influences the IW01’s post-event image, as described in 
Sect. 2.1. We use a toy example to demonstrate the impact of wind on the performance 
of the DCNN model. 

400 samples are chosen from IW01 and IW03 to create [WO1_selected and 
IW03_selected. Then, using the DUAL+Diff architecture, we train two models on 
IWO1_ selected and IW03_ selected, and test them on IWO1 and IW03. There are 
four scenarios here: (1) IWO1_selected training, IWO01 testing; (2) IW01_selected 
training, IW03 testing; (3) IW03_selected training, IWO01 testing; (4) IW03_selected 
training, [WO3 testing. Table 3 contains their results. The column names in this table 
denote the training subsets, whereas the row names denote the testing subsets. The 
numbers in each block indicate classification accuracy, recall, and precision. The 
numbers on the table’s diagonal show training scene is the same as testing scene. 
The performances are excellent for obvious reasons. Outside of the table’s diagonal, 


Table 3 Toy experiment results of wind influence 
Trained on IW01_ selected Trained on IW03_ selected 
(Accuracy (Accuracy 
Recall Recall 
Precision) Precision) 


Tested on IWO01 


Tested on IW03 
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performance drops sharply. The model of IWO1_selected training and IW03 testing 
presents low precision. We may deduce from the aforementioned information that the 
IWO01_selected’s post-event imaging is impacted by severe wind. In this situation, 
the subset IWO1_selected convinces the model to find flooding areas with higher 
backscattering in the post-event image. This will cause false positive predictions and 
lower precision, if the model is tested on IW03. Based on the similar logics, we may 
understand the model of [W03_selected training and IWO01 testing presenting low 
recall. 

Based on the explanations from the toy example, we can get a better understanding 
of the total performance evaluation listed in Table 2. We can have two observations. 


1. The model trained on VH polarization has better performance than that trained 
on VV polarization. The possible explanation is VH is less sensitive to the wind 
conditions. 

2. The S2000 dataset is created from data of different wind conditions, the model 
trained on the S2000 dataset performs better in terms of balance. However, the 
results, particularly those tested on IW01 and IW03, still show the impact of wind. 
This tells us, in the future research, the DCNN models should be aware of the 
wind conditions. It is a direction to further improve the performances. 


a. Since VH is less sensitive to wind conditions, the model can only use VH 
polarization. However, we can not deny that VV has its own advantages for 
flooding mapping. Maybe this is a design with much information loss. 

b. The wind information can be input together with the image information, and 
the dual-polarization information fusion can be realized in a more flexible way. 


In the deep learning-based paradigm for image understanding, the loss functions 
play an important role. They set the end rules for the models, making the predictions 
close to targets. The closeness is measured by losses. In this study, the BCE loss is 
useful and suitable for binary classification. The BCE loss can be adjusted according 
to user-defined costs. Accordingly, the performances will be adjusted. We use the 
toy experiments of two models with the DUAL+Diff design, one of IW01_selected 
training and [WO] testing, and one of [W03_selected training and IW03 testing. 

The BCE loss is used first, and the results are shown in Table 4. The numbers in 
each block are classification accuracy, recall, and precision. From the first row, BCE 
is capable of balancing accuracy and recall. 

In real applications, the users may have personalized needs, higher recall or higher 
precision. These personalized needs can be understood as cost-defined requests. If 
users believe that the cost of low recall is very great, the model must improve recall at 
the price of precision. Based on the similar logics, if users believe that the cost of poor 
precision is very great, the model must improve precision at the price of recall. To 
meet these requirements, cost-sensitive losses are utilized. The type-defined weighted 
a-balanced BCE (aBBCE) loss [17] is one technique to build cost-sensitive losses: 


1 
LoBBCE = -N 


l 


fayi log ĵ; + (1 —a)(1 — yi) log (1 = 5) 0) 


N 
=1 
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Table 4 Toy experiment results of cost-sensitive losses 
Trained on IW01_selected Trained on IW03_selected 
Tested on [WO1 Tested on TW03 
(Accuracy (Accuracy 
Recall Recall 


Precision) Precision) 


BCE 


aBBCE (a = 0.8) 
Weight introduced into 
BCE 
Ff (B=2) 


Directly use F1-score loss 


with weight 
Ff ($ = 0.5) 
Directly use F1-score loss 


with weight 


where the ith pixel’s label is denoted as y;; the prediction is denoted as y;; the 
pixel number is denoted as N; the weight is denoted as œ € [0, 1]. The accuracy of 
flooding is given more weight during training as œ is higher. Given that the flooding 
pixel number is significantly less than non-flooding pixel number, a larger value for œ 
is appropriate. The value of a in this experiment is 0.8. The wBBCE loss is effective, 
as seen in Table 4’s second row. Because the accuracy of flooding is given more 
weight during training, recall is increased at the price of precision. 

Another technique to build a cost-sensitive loss is to utilize the F£ score directly: 


» P.R 
Lna=1-(1+8') a eae 
—<—<—<<—_,__—_—" 

Ff score 


(2) 


where R and P is recall and precision, respectively. 6 is a positive real weight. Mini- 
mizing the F£ loss can increase the F£ score. If 6 is greater than 1, optimizing recall 
receives more attention during training. If 6 is less than 1, precision optimization is 
given more attention. This is clearly a more direct way of controlling the recall and 
precision in the results by their importance. The 3rd and 4th rows of Table 4 show the 
results of the F£ loss, 6 = 2 and £ = 0.5, respectively. The results confirm that the 
F£ loss is an effective way for adjusting recall and precision in the results according 
to their importance: (1) For 6 = 2, recall is increased at the price of precision; (2) 
For 6 = 0.5, precision is increased at the price of recall. 
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Here, we use toy examples to show the design of cost-sensitive losses. There are 
two points should be aware of for designing these losses. 


1. There is a performance tradeoff between recall and precision (there is no free 
lunch). 

2. There is one more hyper-parameter should be pre-defined. This one more hyper- 
parameter gives us more control over the performance. 


5 Application Case in Bangladesh 


Bangladesh is a participating country in the Belt and Road Initiative, and it is one of 
the UN-defined least developed countries. Bangladesh is located on the coast of the 
Indian Ocean and has a low-lying terrain. Under the influence of the rainy season 
and tropical cyclones, severe flooding occurs every summer, especially from June to 
October. Flooding poses a huge threat to the safety of people’s lives and property in 
the country, and has become an obstacle to the country’s development. This chapter 
uses the SARCFMNet model to carry out a nation-level, multi-year, high-temporal- 
resolution flooding mapping of Bangladesh from 2016 to 2020. This can deepen 
our understanding of the flooding mechanism in Bangladesh, and provide powerful 
technology and data support for disaster mitigation and flood forecasting. 

In order to provide the nation-level, multi-year, high-temporal-resolution flood- 
ing mapping products for Bangladesh from 2016 to 2020, we use the following 
processing for Sentinel-1 data based on the preprocessing introduced in Sect. 2.2. 


1. Foreach year, we select images from February to March of that year to put together 
a nation-level pre-event image. 

2. For each year, from a time window of each month from June to October, we select 
images to put together a nation-level post-event image. 

3. The SARCFMNet model trained on the S2000 dataset is performed on the image 
pairs to get the nation-level flooding mapping result. 


From the aforementioned steps, we provide 45 nation-level flooding mapping 
results: 


1. Year 2016: From June to October, one nation-level flooding map is provided every 
month. 

2. Year 2017: From June to October, two nation-level flooding maps are provided 
every month, the first and second halves of the month. 

3. Year 2018: From June to October, two nation-level flooding maps are provided 
every month, the first and second halves of the month. 

4. Year 2019: From June to October, two nation-level flooding maps are provided 
every month, the first and second halves of the month. 

5. Year 2020: From June to October, two nation-level flooding maps are provided 
every month, the first and second halves of the month. 
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In 2016, since the temporal resolution of Sentinel-1 SAR data is relatively low, 
there is one nation-level flooding map per month. From 2017 to 2020, there are two 
nation-level flooding maps per month. 

The flooding maps have the following characteristics: 


e Spatial extent: Bangladesh 
e Temporal extent: 2016-2020 


Fig. 9 Bangladesh nation-level flooding occurrence probability maps from 2016 to 2020. The 
flooding occurrence probability map is generated from the flooding maps of each year from June 
to October 
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Fig. 9 (continued) 


e Spatial resolution: 3 arcsecond, consistent with main world-level DEM products 
e Temporal resolution: half a month (For 2016, a month) 


Based on the flooding maps, we first analyze the flooding occurrence probability 
each year, which is shown in Fig.9. The flooding occurrence probability map is 
generated from the flooding maps of each year from June to October. This shows 
that, each year, the spatial distribution of high flooding occurrence probability is 
relatively stable. For each year, there are some flooded areas, which are not flooded 
areas for other years. Based on the products, these phenomena can be analyzed case 
by case. 

Based on the flooding maps, we then analyze the flooding extent each year, which 
is shown in Fig. 10. From this analysis, we can get the following information: 


1. We already know that the flooding mainly happens from June to October in 
Bangladesh due to rainy season and tropical cyclones. In the yearly flooding 
extent from 2016 to 2020, we can narrow down that the most severe flooded time 
window is from the second half month of July to the first half month of August. 

2. For each year from 2016 to 2020, the peak flooding area is around 2e4 km’. 


The above two analyses just show the usefulness of the provided flooding maps. By 
using nation-level, multi-year, high-temporal-resolution flooding mapping products 
of Bangladesh from 2016 to 2020, we can perform more spatial and temporal, targeted 
analyses. Hopefully, this can deepen our understanding of the flooding mechanism 
in Bangladesh, and provide powerful information support for disaster mitigation and 
flood forecasting. 
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Fig. 10 Bangladesh nation-level flooding extent area from 2016 to 2020. In 2016, in the label of 
x-axis, the 00 after month means there is one flooding map product each month. From 2017 to 2020, 
in the label of x-axis, the 01 after month means the product of the first half of the month, and 02 
means the second 
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6 Conclusions 


The SARCFMNet model of mining multi-temporal and dual-polarimetric SAR data 
for coastal inundation classification is presented in this chapter. The SARCFMNet 
is built on U-Net, a benchmarking deep learning model for pixel-level classifica- 
tion that we have modified for the challenges of coastal inundation mapping from 
SAR imagery: 1) radar remote sensing physics-driven input information design; and 
2) regularization suitable for fully convolutional networks. We present two study 
cases in this chapter. First, the SARCFMNet is trained and evaluated using a dataset 
derived from 2017 Hurricane Harvey-influenced Houston, Texas. Six image pairs, 
with ground truth delineated by human with the help of Google Earth and Open- 
StreetMap, are used to test the proposed SARCFMNet model. The average mapping 
accuracy and F1 score are 0.98 and 0.88, respectively. They are better than the 
benchmarking deep learning model for pixel-level classification. This verifies the 
usefulness of the proposed designs. The geospatial study of Harvey-caused floods 
is performed using the flooding predictions and indicates Harvey’s massive impact 
on agriculture. The multi-temporal study estimates the flooding decreasing rate and 
uncovers a delayed-inundation phenomenon. Second, the trained and verified SAR- 
CFMNet model is applied to Bangladesh, which is one of the UN-defined least 
developed countries, to get nation-level, multi-year, high-temporal-resolution flood- 
ing maps. The flooding maps of Bangladesh are from 2016 to 2020, with spatial 
resolution of 3 arc second and temporal resolution of half a month (for 2016, a 
month). This can help us get deeper understanding of the flooding mechanism of 
this country. In addition, impact of meteorological factors in DCNN-based flooding 
mapping models and cost-sensitive losses are discussed. We propose that this model 
can be easily and readily generalized to other multi-temporal ocean remote sensing 
imagery information mining problems. 
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1 Introduction 


The changes in global sea ice volume, distribution, and movement reflect the interac- 
tion of the atmosphere-cryosphere-hydrosphere and the global climate change [30]. 
Sea ice study is also significant because it causes marine navigation and transportation 
safety concerns. Since the classification of sea ice and open water provides valuable 
information for safe navigation, sea ice classification and monitoring draw extensive 
attention [8, 37, 39]. Satellite remote sensing, such as optical camera, microwave 
radiometer, and synthetic aperture radar (SAR), has been the most effective way to 
monitor sea ice in the polar regions [21, 40]. SAR images have been the primary 
source for sea ice classification and monitoring, due to its high spatial resolution, 
wide-coverage, and ability to penetrate clouds [7]. 

Series of studies have been devoted to classifying sea ice and open water on SAR 
images, including threshold-based methods, expert systems, and machine learning 
methods. Multi-Year Ice (MYI) Mapping System (MIMS) is a typical threshold- 
based model, and it can quickly map MYI in uncalibrated SAR images [13]. The 
representation of expert systems is the Advanced Reasoning using Knowledge for 
Typing Of Sea ice (ARKTOS) [38]. ARKTOS performs a fully automated analysis 
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of SAR sea ice images by mimicking the reasoning process of sea ice experts. For 
machine learning methods, the regression model is an early exploration. Lundhaug 
and Maria [29] proposed a multivariate regression method to model the relationship 
between the mean and standard deviation of the backscattering coefficients and air 
temperatures with sea ice types and water. Their experiments showed the correlation 
coefficients between predicted and actual values were higher than 0.90. Karvonen 
[20] developed a modified pulse-coupled neural network (PCNN) to classify the 
sea ice in the Baltic Sea. Zhang et al. [47] proposed a k-means-based model which 
combines microwave scatterometer and radiometer data to classify sea ice types. 
Zakhvatkina [45] extracted textural features from the gray-level co-occurrence matrix 
(GLCM) and input the features into an artificial neural network (ANN)-based model 
to classify sea ice and the open water. Similarly, researchers combined the GLCM 
with other machine learning algorithms, such as Markov random field (MRF) [6] 
and support vector machine (SVM) [25] to classify sea ice from SAR images. 

Overall, the main drawback of the aforementioned traditional methods is that 
they need prior expert knowledge and sophisticated manual engineering to extract 
features for discriminating between sea ice and open water. This drawback has been 
a common challenge faced by the earth system science in the era of big data [33]. 

Deep learning (DL) technology addresses the mentioned challenge [19]. A typical 
DL model consists of deep neural networks (DNN), which accepts input data in a 
raw format and automatically discover the required features [24]. In recent years, 
DL has been successfully applied in oceanography, geography, and remote sensing, 
which has helped humans gain further process understanding of earth system science 
problems [27, 32-34, 43]. A deep convolution neural network (CNN) is a particular 
type of DNN composed of CNN layers. A CNN layer connects to the local patches 
of the previous layer through convolution kernels to extract local spatial features 
[22]. Since CNN-based methods have achieved great success in image classification, 
researchers employed CNN to extract features automatically to improve the accuracy 
and efficiency of sea ice classification. Yan and Scott [44] introduced an early CNN- 
based model AlexNet [2], and transfer learning to classify sea ice and open water. 
Li et al. [26] proposed a CNN-based model to classify sea ice and open water from 
Chinese Gaofen-3 SAR images . Wang et al. [42] constructed a CNN model consists 
of three CNN layers and two fully connected neural network layers to classify sea 
ice near the Bering Strait. [16] integrated transfer learning and dense CNN blocks 
to form a transferred multilevel fusion network (MLFN). The MLEN outperformed 
the PCAKM [5], the NBRELM [15], and the GaborPCANet [12] in classifying sea 
ice and open water. 

More and more researchers are trying to construct DL-based models to achieve 
end-to-end classification between sea ice and open water. Though the aforementioned 
DL-based models deliver excellent performances, several issues still exist. First, 
classification accuracy needs to be further improved. Especially for the medium-high 
resolution SAR images, fine-grained objects such as small floes, sinuous ice-water 
boundaries, and ice channels need to be well classified. Second, the information 
of SAR images, such as dual-polarization information and incident angle (IA), are 
not fully utilized by most DL-based models. The benefit of fusing dual-polarized 
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information has been demonstrated in the conventional method [25], and the IA 
affects the radar backscattering intensity. All this information should be considered 
to improve classification accuracy. Third, most of the existing models are validated 
by independent images, and their applicability to more challenging tasks, such as 
classifying a series of images from freezing to melting, remains to be verified. 

Aiming to solve the issues mentioned above, we propose a dual-attention U-Net 
model, DAU-Net, to classify sea ice and open water on SAR images. U-Net was 
initially developed for the semantic segmentation of biomedical images [35]. It is 
designed to work with fewer training samples but is still able to yield precise segmen- 
tations. The effectiveness of employing U-Net to solve classification or segmentation 
problems of geoscience has been demonstrated [11, 28, 46]. Therefore, we use the 
U-Net as the backbone of the classification model. The dual-polarized information 
and the IA of SAR images are utilized as the model inputs. To extract more charac- 
teristic features from the multiple input information, we integrate the dual-attention 
mechanism [14] to optimize the origin U-Net. Finally, we use SAR images near the 
Bering Sea to train and evaluate the model. We validate the applicability of DAU-Net 
by aseries of SAR images of Bering Strait and compare the classification results with 
the sea ice products of the National Snow and Ice Data Center (NSIDC). 


2 Data 


2.1 Study Area 


The study areas are the Bering Sea and Bering Strait, which locates near the out 
edge of the sea ice on the Pacific side of the Arctic (Fig. 1). The Bering Strait is the 
only channel for water exchange between the Pacific Ocean and the Arctic Ocean, 
showing strong atmosphere-sea-ice interactions and supports one of the world’s most 
productive and valuable fisheries with ever increasing commercial vessel activities 
[9]. Therefore, sea ice detection and monitoring in this region are of great interest to 
scientific research communities and commercial fishing and transportation industries. 


2.2 SAR Images 


The SAR images are obtained from Sentinel-1A in the interferometric wide-swath 
(IW) mode with a swath width of 250 km. The images are the ground range 
detected (GRD) products with VV + VH (vertical emitting and vertical and horizontal 
receiving, respectively) polarizations. The IA is between 30.00-46.00 degrees. The 
range and azimuth resolutions are 5 and 20 m, respectively, with a sampling space 
of 10 m. 
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Fig. 1 The location of the study area 


The data set consists of 34 SAR images as shown in Table 1, and is divided into 
three subsets: 1) the training set, 2) the testing set, and 3) the applicability validation 
set (Fig. 1). The model training set includes 15 images (No.1 — No.15 in Table 1). 
The testing set is the No.16 image in Table 1, and we used this image to evaluate 
the model performance by metrics. The applicability validation set is a series of 
images covering the Bering Strait. The series contains six images, each of which is 
mosaicked from three single Sentinel-1A images, a total of 18 Sentinel-1A images 
(No.17—No.34 in Table 1). The image series covers the whole ocean process from 
freezing to melting of the Bering Strait. Therefore, we could validate the applicability 
of the well-trained model by monitoring the entire cycle of sea ice in the Bering Strait. 


2.3 NSIDC Sea Ice Products 


The sea ice products of the NSIDC [41], named Multisensor Analyzed Sea Ice Extent 
- Northern Hemisphere (MASIE-NH), are employed as a reference for the applica- 
bility discussion. The product is based on the Interactive Multisensor Snow and Ice 
Mapping System (IMS) results produced by the National Ice Center (NIC). NIC uti- 
lizes visible imagery, passive microwave data, and NIC weekly analysis products to 
create their data product. MASIE-NH provides measurements of daily sea ice extent 
and sea ice edge boundary for the Northern Hemisphere and 16 Arctic regions in a 
polar stereographic projection at both 1 and 4 km grid cell sizes [41]. We choose the 
1 km MASIE-NH products as the reference. 
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Table 1 Information of the SAR images 


No. Imaging Date Center Location Function 
1 13/12/2018 169.13W, 63.21N training 
2 25/12/2018 169.78W, 61.74N 

3 25/12/2018 169.13W, 63.21N 

4 06/01/2019 169.78W, 61.74N 

5 18/01/2019 169.78W, 61.74N 

6 04/02/2019 171.77W, 61.87N 

7 06/02/2019 167.82W, 61.43N 

8 11/02/2019 169.78W, 61.74N 

9 14/03/2019 167.19W, 62.93N 

10 14/03/2019 167.84W, 61.48N 

11 19/03/2019 169.78W, 61.74N 

12 24/03/2019 171.77W, 61.87N 

13 26/03/2019 167.20W, 62.93N 

14 31/03/2019 170.35 W, 60.25N 

15 24/04/2019 169.13W, 63.21N 

16 24/04/2019 169.78W, 61.74N testing 
17 13/12/2018 166.90W, 67.66N applicability validating 
18 13/12/2018 167.72W, 66.17N 

19 13/12/2018 168.42W, 64.70N 

20 25/12/2018 166.90W, 67.66N 

21 25/12/2018 167.72W, 66.17N 

22 25/12/2018 168.42W, 64.70N 

23 31/03/2019 166.90W, 67.66N 

24 31/03/2019 167.72W, 66.17N 

25 31/03/2019 168.42W, 64.70N 

26 12/04/2019 166.90W, 67.66N 

27 12/04/2019 167.72W, 66.17N 

28 12/04/2019 168.42W, 64.70N 

29 24/04/2019 166.90W, 67.66N 

30 24/04/2019 167.72W, 66.17N 

31 24/04/2019 168.42W, 64.70N 

32 06/05/2019 166.90W, 67.66N 

33 06/05/2019 167.72W, 66.17N 

34 06/05/2019 168.42W, 64.70N 
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2.4 Data Preprocessing 


We use SNAP 3.0 [10] to perform radiometric calibration and boxcar filtering on 
all SAR images. As the size of the source SAR image is too large, we downscale 
each image to 1/3 of the original image size, about 8,000 x 5,000 pixels. Although the 
spatial resolution is downscaled from 10 m to 30 m, it is still much higher than that of 
the MASIE-NH products (1 km). It is far more detailed than could be expected from 
existed manual or operational automatic classifiers [25]. We scale all pixel values to 
0-1. Al TA values are scaled to 0-1, referred by 0°-90°. 

The SAR images are labeled into two classes, 1 for sea ice and 0 for open water, by 
the annotation tool LabelMe [36] to obtain the ground truth labels. As the resolutions 
of existing sec ice products are much lower than that of the Sentinel-1A images [25], 
the labeling process is based on visual interpretation. For regions that are difficult 
to distinguish, we refer to the 1 km MASIE-NH products to label them. In this 
way, most of the pixels in the SAR images could be labeled correctly. Due to the 
limitations of SAR image noise and manual labeling, there are inevitably a few 
mislabeling pixels, and some small sea ice objects cannot be accurately labeled. 
This is a common problem in the supervised learning field. For most classification 
missions, such mislabeled pixels account for a small proportion of all pixels and do 
not affect the convergence of the model [17]. 

We divide all images (VV, VH, and IA) into 256x256-pixel chips as the model 
inputs. Fig.2 takes the VV channel as an example to show the SAR image chips and 
the corresponding ground truth labels. 


Fig. 2 Image chips (VV channel) and the corresponding labels. a-h SAR image chips with 
256 x 256-pixel. i-p Labels corresponding to the a-h SAR images 
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3 Method 


3.1 Overall Structure of DAU—Net 


The backbone of the proposed DAU-Net is a U-Net model. The U-Net is named 
for its almost symmetric encoder-decoder network architecture like a “U” shape 
and is designed to work with fewer training samples but still able to yield precise 
segmentations. The encoder extracts abstracted, downscaled high-level feature maps. 
The decoder restores the resolution of the high-level feature maps. The intermediate 
feature maps extracted by encoder and decoder are connected to form multi-scale 
feature maps for pixel-level classifications. The encoder can be a mature DNN model, 
such as VGG16, ResNet18, ResNet34, etc [4, 23]. 

Discriminant feature representations are essential for improving classification 
accuracy. To achieve a high accuracy classification between sea ice and open water 
in medium-high resolution SAR images, we need more characteristic features to 
discriminate fine-grained objects such as small floes, sinuous ice-water boundaries, 
and ice channels. Therefore, we integrate a dual-attention mechanism into the original 
U-Net and form a DAU-Net model to improve the feature representations of sea ice 
and open water. The dual self-attention mechanism means position attention module 
(PAM) and channel attention module (CAM), which could capture the long-range 
dependencies in spatial and channel dimensions. It has been demonstrated effective 
in classical image segmentation [14]. 

The PAM captures long-range dependencies in spatial dimension by a self- 
attention mechanism. For a feature map, the feature value at a specific position 
is updated by aggregating feature values at all positions with a weighted summation. 
The weights are determined by the feature similarities between the corresponding 
two positions. Any two positions with similar features can contribute to mutual 
improvement regardless of their distance in the spatial dimension. Similarly, the 
CAM employs the self-attention mechanism to capture the channel dependencies 
between any two-channel maps. Each channel map is updated by a weighted sum 
of all channel maps. Finally, the outputs of these two attention modules are fused to 
enhance the feature representations further. 

Overall, as shown in Fig.3, the DAU-Net consists of five parts: inputs, encoder, 
attention, decoder, and output. Each input unit consists of three channels of a 
256 x 256-pixel SAR image: VV, VH, and JA. The encoder is the ResNet-34, a mature 
model for image recognition, and it extracts abstracted, downscaled feature maps for 
accurate classification. The attention part performs position attention and channel 
attention on the extracted feature maps to capture long-range dependencies in spatial 
and channel dimensions. The outputs of the two attention modules are fused to form 
more characteristic features transmitted to the decoder. The decoder module rescales 
the downscaled feature maps to the original size. Skip connections link the encoder 
feature and decoder feature. Next, we will detail the encoder, attention, decoder, and 
output modules. 
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Fig. 3 Model design. a The model’s input: VV, VH, and IA channels. b The model’s encoder. 
c The attention modules. D. The model’s decoder. e The model’s output 


3.2 Encoder 


He et al. [18] proposed the residual network (ResNet) to increase the number of hid- 
den CNN layers to more than one hundred. The ResNet family includes ResNet-18, 
ResNet-34, ResNet-50, and ResNet-101, where the number represents the num- 
ber of CNN layers. Large numbers mean more CNN layers, more parameters, and 
more training complexity. The ResNet family has been widely used in semantic seg- 
mentation and object detection. Considering the depth of the model, the number of 
trainable parameters, and the complexity of sea ice texture, we choose ResNet-34 as 
the encoder for DAU-Net. The comparisons between the ResNet-34 and the other 
ResNet-based encoders are carried out in the F part of Section IV. 

The encoder consists of 33 CNN layers of the ResNet-34, including five stages. 
The first stage is one CNN layer with 7 x7 kernel size and 2x2 strides. After the first 
stage, the original image size is downscaled to 128 x 128. The remaining four stages 
are composed of 3, 4, 6, and 3 ResNet blocks and a total of 16 ResNet blocks, Fig. 3. 
Each ResNet block contains two stacking CNN layers with a shortcut connection 
linking the input of the block and the output of the 2"! CNN layers [18]. The number 
of convolutional kernels in the five stages is 64, 64, 128, 256, and 512. The original 
ResNet34 model uses four 2x2 max-pooling layers that are stacked on four ResNet 
stages to downscale the feature map. Here, we discard the last max-pooling layer 
and retain the first three max-pooling layers. The activation function of each CNN 
layer is ReLU [1]. After encoding, the origin inputs are transformed into 512 16x 16 
feature maps. Following, these high-level features are transmitted to the attention 
part. 
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3.3 Attention 


The 512 16x16 feature maps extracted by the encoder are fed into the PAM and 
CAM to capture spatial and channel dependencies. The outputs of these two attention 
modules are fused and transformed into the decoder. 


3.3.1 PAM 


Since CNN adopts local connection, the features captured by CNN are local. For 
semantic segmentation, local features generated by fully CNN are not representative 
enough, which could lead to misclassifications [31]. The PAM addresses this issue. 
The PAM updates the feature value at a specific position by aggregating feature values 
at all positions with a weighted summation. Thus, the global spatial dependencies 
of any two positions could be captured. These global features are fused with local 
features to form more characteristic features. Following, we will detail the calculation 
of PAM. 

As shown in Fig. 4a., let H, W, and C represent the width, height, and channels, and 
Ae R®**C is a local feature map extracted from the model inputs. The white/dark 
regions represent sea ice/water features. There are some inaccurate features in A, 
especially the regions marked by the red rectangle. Then A is fed into all three CNN 
layers to generate three feature 

maps Be R”*W*C, Ce R4***C and De R4*"*C, as shown in Fig. 4b. B is 
reshaped to B! € RYC, where N = H x Wis the number of pixels. C is reshaped 
and transposed to C! € R°*%.. Then, matrix multiplication is performed between B! 
and C!. Then, the multiplication result is activated by a softmax layer to calculate 
the spatial attention map S € RY”. The softmax activation [3] normalizes S by row 
and makes the sum of each row is 1. The more similar feature representations of 
the two positions contribute to a higher correlation between them, generating a large 
value in S. 


Calculation process of PAM 


reshape p 
B 
reshape. 


transpose 
C 


= > 


D 


Sea iced -| 


Thy 
Water 4439 
Auxwec 


Hx WxC 


Fig. 4 Flow of PAM. a before PAM, some water pixels are misclassified. b PAM. c after PAM, the 
misclassified pixels are corrected 
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The global dependencies of any two positions in the feature map modeled by S. 
D is reshaped to D! € R**°. § is multiplied by D! to generate A°e R‘*°: 


aj, = $i - Dj, j € [1, C] (1) 


where ai; is an element of A‘, S; is the isn row of S and Dj is the j,, column of D!. 


A’ is reshaped to Ale R™*"* For each channel of A! , the element of a position is 
the weighted sum of elements across all positions in the corresponding channel of D 
based on the weights in S. Therefore, A! has a global contextual view and selectively 
aggregates contexts according to the spatial attention map. A! is multiplied by a 
scale parameter a and added to the input feature map A in element-wise to obtain 
the output E7*"*C; 

E=aA'+A (2) 


where « is initialized as 0 and gradually learns to assign more weight. 

The pixel value of the output feature map E is a weighted sum of the features across 
all pixels and original features. E integrates the local features and the long-range 
global features. The similar semantic features achieve mutual gains, thus improving 
intra-class compact and semantic consistency. Intuitively, as shown in Fig. 4c, the 
inaccurate features in A are optimized by the PAM, which contributes to the final 
output. 


3.3.2 CAM 


Each channel map of high-level features can be regarded as a class-specific response, 
and different semantic responses are associated with each other. The CAM updates 
the feature value at a position by aggregating feature values of all channels in the same 
position with a weighted sum. The interdependencies between channels of feature 
maps are captured, which improves the feature representation of specific semantics. 
The structure of CAM is illustrated in Fig.5. As shown in Fig. 5a., let H, W, and 
C represent the width, height, and channels, and Ac R“”*™*C is a local feature map 
extracted from the model inputs, Fig.5a. The channel attention map X € R©*© is 
calculated from the original features Ac R”*W*C, Fig. 5b. A is reshaped to A! € 
IR“*C, and is reshaped and transposed to A? € R©*".. Then, a matrix multiplication 
between A? and A! is performed. Then, a softmax layer is applied to obtain the 
channel attention map X. The more similar feature representations of the two channels 
contribute to a higher correlation between them, generating a larger value in X. The 
sum of each row in X is 1. A! is multiplied by the transpose of X to generate 
A* e RYXC; 
ay, = Ai- Xj j e[l, C] (3) 


where aj; is an element of A*, Al is the į” row of A! and X j is the jı, column of X. 
A* is reshaped to A? € R4**C, For each position of A*, the element of a channel 
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Fig. 5 The detailed calculation process of CAM in the DAU-Net. a Feature maps without CAM, 
some water pixels are inaccurately encoded as sea ice pixels, marked in the red rectangle. b The 
calculation process of CAM. c Feature maps after CAM. Some inaccurate sea ice pixels are modified 
as water pixels, improving the accuracy of outputs 


is the weighted sum of elements across all channels in the corresponding position of 
A based on the weights in X. Therefore, A? has long-range contextual dependencies 
in channel dimensions. A? is multiplied by a scale parameter 6 and added to the 
input feature map A in element-wise to obtain the output F4’*"*C: 


F=BpA+A (4) 


where £ gradually learns a weight from 0. 

The final feature of each channel is a weighted sum of the features of all chan- 
nels and original features. The long-range semantic dependencies between different 
channels of the feature maps are modeled, which boosts feature discriminability. 
As shown in Fig.5a, many open water regions are inaccurately represented as sea 
ice features in feature map A. After the channel attention procedure, most of the 
inaccurate regions in A are corrected, Fig. 5c. The outputted feature map F is more 
discriminating than A and helps to achieve a good classification result. 


3.3.3 Fusion 


The PAM output and CAM output are separately transformed by a CNN layer. An 
element-wise summation is performed on the two transformed results. A CNN layer 
executes convolutions on the summation to generate fusion features. Finally, the 
fusion features are transmitted to the decoding part. 
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3.4 Decoder 


Five decoder modules are stacked upon the features outputted by attention modules, 
and each decoder module is composed of one up-sampling layer and two stacking 
CNN layers. Each CNN layer is followed by a batch normalization layer and a RELU 
activation layer. The number of convolutional kernels in the four decoders is 256, 128, 
64, 32, and 16, respectively. Three concatenations fuse the features generated from 
the same level encoder and decoder. The kernel size of all CNN layers in decoder 
modules is 3x3. After decoding, the 16x 16 feature maps are rescaled to the same 
size as the input image, 256x256. 


3.5 Output 


The feature maps output by the decoder are fed into the output module that consists 
of one CNN layer with one 1x 1 convolutional kernel. One sigmoid layer performs 
non-linear activation on the convolutional outputs to predict the value of each pixel. 
The activation value is between [0,1]. If it is larger than 0.5, the pixel is sea ice; 
otherwise, it is open water. The loss function is binary cross-entropy. 


4 Experiments 


4.1 Experiments Setting 


There are 4,684 SAR chips in the training set. We split 30% samples from the training 
set as the validation set. We choose a typical image with rough sea surface and various 
sea ice textures as the testing image. We divided the testing image into 672 256x256 
chips. The developed model runs on a GPU workstation with one NVIDIA TESLA 
V100 32 GB GPU. Its batch size is 16, and the initial learning rate is 0.0001. We use 
Keras as the DL packages, and the ReduceLROnPlateau and early stopping strategies 
in Keras are employed to accelerate convergence and avoid overfitting. 


4.2 Evaluation Metrics 


Accuracy, precision, recall and mean intersection over union (IoU) are employed 
to evaluate the performance of the classification methods. The definition of these 
metrics is shown in Fig. 6. Precision refers to the proportion of correctly predicted 
pixels, both sea ice, and water, among all predicted pixels. Precision refers to the 
proportion of pixels that are true sea ice and predicted as sea ice to all predicted 
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Predicted sea ice 


True Water 


E] True sea ice 
== 


TP: true sea ice pixels which are predicted as sea ice pixels 
FP: true water pixels which are predicted as sea ice pixels 
FN: true sea ice pixels which are predicted as water pixels 


TN: true water pixels which are predicted as water pixels 


Precision = —————————_ 


Accuracy = 


_ 


Recall = IoU = 


Fig. 6 Definitions of accuracy ((TP+TN)/(TP+TN+FP+FN)), precision (TP/(TP +FP)), recall 
(TP/(TP +FN)), and IoU (TP/(TP +FP+FN)) 


sea ice pixels. A higher precision value means the model extracts less false alarms. 
Recall refers to the proportion of pixels that are true sea ice and predicted as sea ice 
to all true sea ice pixels. A higher recall value means the model misses fewer sea ice 
pixels. IoU means the proportion of pixels that are true sea ice and predicted as sea 
ice to the union of true sea ice and predicted sea ice pixels. When the predicted sea 
ice pixels coincide with the true sea ice pixels completely, the IoU is the maximum 
value of 1. 
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4.3 Comparison Experiments Against Other Models 
Performances 


To validate the performance of the proposed DAU-Net, two recently proposed DL- 
based sea ice classification models are selected for comparison: 1) CNN wang, which 
is the CNN-based detection model proposed by Wang et al. [42] in 2018. It consists 
of five CNN layers and three max-pooling layers; 2) DenseNetrcy, which has a 
similar structure with the MLFN model proposed in 2019 [16]. To satisfy the pixel- 
level segmentation and make a fair comparison, DenseNetrcy replaces the fully 
connected layers in the original MLFN with fully convolutional layers and adds 
upsampling blocks, forming a “U” shape segmentation model. 

We also compare our model performance against the classic U-Net model that 
has a similar structure with DAU-Net except that the CAM and PAM are removed. 
It is worth noting that the CNN layers after two attention modules and the CNN 
layer of the fusion part are retained to ensure a fair comparison. U-Netc4y means 
the U-Net model with CAM but no PAM. U-Netp 4m is the U-Net model with PAM 
but no CAM. Similarly, the CNN layers are retained in these two models. We tune 
the hyper parameters of all compared models and record the results with the best 
accuracy. 

The evaluation metrics of all models are shown in Table 2, and the correspond- 
ing classification results are shown in Fig.7. The accuracy, IoU, and precision of 
CNNwang are lower than those of the other five models. However, the recall of 
CNN vang is the largest one. The precision and the recall are very unbalanced, which 
means CNN pang misses fewer sea ice pixels but misclassifies many open water pix- 
els as sea ice (high false alarms). As shown in Fig.7d, the classification results 
of CNNwang, Such as sea ice edges and ice blocks, are coarse-grained. Limited by 
the model complexity, it is difficult for CNNwang to extract enough representative 
features to achieve fine-grained classification, thus generate many false alarms. Com- 
pared with CNN ang, the accuracy, IoU, and precision of DenseNet rcy are improved 
obviously, and recall is reduced. The gap between precision and recall is narrowed. 
Fig. 7e shows that the classification results are much more refined than those of 
CNN wang. However, there are still some false alarms in the region marked by the 


Table 2 Evaluation results of all compared models 


Model Accuracy(%) IoU Precision Recall 
CNNwang 0.9520 
DenseNetrcon 0.9298 
U-Net 0.9222 
U-Netcam 0.9305 
U-Netpam 0.9207 
DAU-Net 0.9225 
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Fig. 7 a—c Inputs of the test SAR image, VV channel, and VH channel are scaled to 0-255 for 
better visualization. d-i classification results of different models 


red rectangle. Although DenseNetrcy is more complicated than CNN, it is still not 
enough to extract sufficiently characteristic features to accurately distinguish sea ice 
and water, especially in areas where sea ice and water are mixed under complex sea 
conditions. 

The U-Net model outperforms CNNyang and DenseNetrcy in both accuracy 
and IoU. Its recall and precision are also more balanced. Fig. 7f shows that the 
U-Net obviously reduces the false alarms generated by DenseNetFCN (marked by 
the red rectangle). By introducing attention modules, U-Netc,y and U-Netp 4m 
show improvements in accuracy and IoU. The precisions and recalls do not show 
significant improvements. However, as shown in Fig. 7g-h, the classification results 
of U-Netc am and U-Netp 4m are more refined, and the boundary between sea ice and 
open water is more smoother. Finally, the DAU-Net, integrated with CAM and PAM, 
obtains the most considerable accuracy, IoU, and precision (Table 2). Compared 
with the original U-Net model, the accuracy, IoU, and precision of the DAU-Net 
increased by 0.50%, 1.00%, and 1.14%, respectively. The accuracy and recall are in 
balance. By comparing Fig. 7i and f, it can be found that the false alarms generated by 
U-Net are reduced significantly, and the classification results of DAU-Net are more 
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refined. The fine-grained objects such as small floes, sinuous ice-water boundaries, 
and ice channels are classified more smoothly by DAU-Net. Therefore, the CAM 
and the PAM can improve the representative ability of extracted features to promote 
the classification results of sea ice and open water. 


4.4 Effectiveness of IA 


As the IA is ignored in existed DL-based models [16, 42], we design an experiment 
to evaluate the effectiveness of employing the IA of SAR images as one input. Table 3 
shows the experiment results. DAU-Net is the model with IA, and DAU-Nety 7, is 
the model without IA. The other experiment settings are unchanged. The accuracy 
and IoU of DAU-Nety 7,4 are less than those of the DAU-Net. The precision is much 
larger than the recall, which means DAU-Nety7,4 misses many sea ice pixels. As 
shown in Fig. 8c, some sea ice pixels are misclassified as open water in the upper left 
part of the image. Thus, the IA is essential to obtain better classification results. 


Table 3 Evaluation results of using IA 

Model Accuracy(%) IoU Precision Recall 
DAU-Nety74 
DAU-Net 


€ DAU-Netyay 


nd a ed 


e DAU-Netyy f DAU-Net 


[__]seaice [J Water 


Fig. 8 a and b, VV channel and VH channel of the testing set; c-f, classification results of the 
model without IA, VH, VV as inputs, separately 
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4.5 Effectiveness of Dual—Polarization Information 


We design an experiment to evaluate the effectiveness of dual-polarization inputs. 
DAU-Net uses the VV channel, VH channel, and IA as the inputs. DAU-Netyy uses 
VV channel and JA as the inputs, and DAU-Nety y uses VH channel and IA as inputs. 
The other experiment settings are unchanged, as shown in Table 4. The four metrics 
of DAU-Nety y are smaller than those of the other two models. As Fig. 8e shown, 
DAU-Nety misclassifies many sea ice pixels as open water, mainly the pixels in 
the upper left part of the image. DAU-Netyy performs better than DAU-Nety x, but 
it still misses some sea ice pixels in the middle part of the image, Fig. 8d. Finally, by 
combining VV and VH as inputs, DAU-Net achieves the best performance. Thus, the 
dual-polarization information of SAR image is helpful to obtain better classification 
results. 


Table 4 Evaluation results of using Dual-Polarization Information 
IoU 


Precision Recall 


Accuracy(%) 


DAU-Nety y 
DAU-Nety y 
DAU-Net 


0.8742 
0.9225 


4.6 Performances of Different ResNet-Based Encoders 


The encoder in DAU-Net is ResNet-34. We design an experiment to evaluate the 
performances of the other two ResNet-based encoders. DAU-Netjg is the model 
using ResNet-18 as the encoder, and DAU-Netso is the model using ResNet-50 as 
the encoder. The other parts of these two models are the same as those of the DAU- 
Net. As shown in Table 5, the performances of the three models do not show much 
difference. DAU-Net with ResNet-34 as encoder slightly outperforms the other two 
ResNet-based encoders. For our classification mission, ResNet-34 is a more suitable 
encoder than the other two ResNet models. 


Table 5 Evaluation of different ResNet encoders 


Model Encoder Accuracy (%) | loU Precision Recall 
DAU-Netig ResNet-18 

DAU-Netso ResNet-50 93.72 0.8523 0.9292 0.9115 
DAU-Net ResNet-34 94.39 0.8673 0.9355 0.9225 
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5 Discussions 


To validate the robustness of the proposed model, we employ the DAU-Net to clas- 
sify sea ice and open water from a series of SAR images in the Bering Strait and 
compare the classification results with the sea ice products provided by NSIDC. As 
the DenseNetycy represents the existing DL-based classification model for sea ice, 
we take the results of DenseNetzcy as comparison targets. The image series con- 
sists of six images, each of which is mosaiced from three Sentinel-1A images, and a 
total of 18 Sentinel-1A images. Their details are shown in Table 1. The image series 
covers the process from freezing to melting of the Bering Strait, including a variety 
of sea ice textures and sea surface conditions. As shown in Fig. 9a-f, sea ice partially 
appeared in the Bering Strait on Dec 13, 2018, and it covered the entire region until 
Mar 19, 2019. Then, on Mar 31, 2019, the sea ice started to melt, and by May 6, 2019, 
most of it had receded. The most recent data (generally from the previous day) of the 
1 km products appear in the archive at approximately 10:00 p.m. (Greenwich Mean 
Time, GMT). The 18 Sentinel-1A images in the Bering Strait are acquired around 
06:00 p.m. (GMT). Due to the time difference, the date of the MASIE-NH products 
we employed is one day later than the date of the Sentinel-1A images. The cell size 
of the DAU-Net result is 30 m. The spatial resolution of the two data is too different, 
so it is unreasonable to compare their evaluation metrics quantitatively. Here, we 
discuss the performance of DAU-Net through the visual comparison of classification 
results. 

Figure 9g-1 show the classification results of DAU-Net and Fig. 9m-r are the cor- 
responding MASIE-NH products. Overall, the DAU-Net results are consistent with 
the MASIE-NH products. The sea surface in Fig. 9a, d, and fis very rough and bright, 
mixing with the sea ice, especially the regions marked as red rectangles. As shown 
in Fig. 9g, j, and 1, the DAU-Net classifies the sea ice and open water well, which 
demonstrates that the proposed model can deal with the complex sea surface. There 
are many water gaps, small sea ice floes, and sinuous ice-water boundaries in Fig. 9c 
and f, which are finely classified by the DAU-Net, as shown in Fig. 9i and 1. The 
separate water channels in Fig.9e are also successfully classified by DAU-Net, as 
shown in Fig. 9k. As the spatial resolution of the MASIE-NH products is 33.3 times 
lower than that of DAU-Net results. Many fine-grained objects cannot be classified 
in the MASIE-NH products. As shown in Fig. 9i, k, and 1, the classification results 
of DAU-Net are more consistent with the SAR images than the MASIE-NH prod- 
ucts, especially in the regions marked by the yellow rectangles in Fig. 9c, e, and f. 
Taking the region marked by the yellow rectangle in Fig. 9f as an example, we show 
the detailed comparisons between the classification results of DAU-Net and 1km 
MASIE-NH products in Fig. 10. Our classification results show obvious advantages 
over MASIE-NH products in spatial resolution, Fig. 10b-d. 

However, DAU-Net performs not very well in some regions. As marked by the 
green rectangles in Fig. 9a, some sea ice pixels with dark textures are misclassified as 
open water. Some open water pixels with extremely rough surfaces are misclassified 
as sea ice. The misclassifications may be due to the lack of these two types of 
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[]seaice J water en ee [C] DAU-Net performs well in the regions with complex sea conditions, 


DenseNetycx performs worse than 


The results of DAU-Net are more consistent with the CL] DAU-Net does not performs well. DAU-Net 
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Fig. 9 Comparison between results of DAU-Net, MASIE-NH products, and results of 
DenseNetrcy of atime series (Dec 13, 2018-May. 06, 2019) in Bering Strait. a-f SAR images, VV 
channel. g- classification results of DAU-Net. m-r 1km MASIE-NH products. s—x classification 
results of DenseNetrcon 


samples in the training set. The misclassifications mainly exist in the SAR image on 
Dec 13, 2018, the early stage of sea ice in the Bering Strait, with some very dark 
sea ice textures. These textures are rare during the freezing and melting stages. In 
addition, the extremely rough sea surfaces are also rare in the training set, resulting in 
misclassifications. As shown in Fig. 9s-x, the results of DenseNetrcy are generally 
consistent with the MASIE-NH products. However, DenseNetrcy performs worse 
than DAU-Net, especially in the regions marked by red circles. Some rough sea 
surface pixels are misclassified as sea ice pixels. 


272 Y. Ren et al. 


May. 06, 2019 


DerewW wew Mew  leerew 


SAR image 
eTown 
Mron 
Soon 
DAU-Net 


beron 


MASIE-NH 


Fig. 10 A detailed comparison between results of DAU-Net and MASIE-NH products in a repre- 
sentative region marked in Fig. 9f. a The SAR image on May 6, 2019. b—d the detailed SAR image, 
classification results of DAU-Net, and Ikm MASIE-NH products corresponding to the marked 
region 


In summary, by validating the applicability of DAU-Net through a series of SAR 
images in the Bering Strait, we demonstrated that the DAU-Net performs well in most 
sea conditions. The proposed is capable of dealing with various sea ice textures. Due 
to the advantages of SAR image resolution and model performance, the results of 
DAU-Net are more refined than MASIE-NH products. DAU-Net also outperforms the 
existing DL-based sea ice classification model, DenseNetrcy.However, the DAU- 
Net performs not well on some unusual textures. To further improve the model 
applicability, we will collect more training samples to supplement the rare texture 


types. 
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6 Conclusions 


This study proposes a DAU-Net model to classify the sea ice and open water from 
SAR images. We combine the ResNet34 with the U-Net to form the model backbone. 
SAR images are obtained from Sentinel 1A. The dual-polarized information and the 
IA of SAR images are utilized as the model inputs. We integrate the dual-attention 
mechanism, PAM and CAM, into the original U-Net model to extract more char- 
acteristic features, which helps to achieve more accurate classifications. We use 15 
Sentinel-1A SAR images acquired near the Bering Sea to train the model. We eval- 
uate the model performance by one SAR image and compare the DAU-Net with the 
typical DL-based ice classification models. Further, we use the well-trained model 
to classify a series of SAR images of Bering Strait, which covers the process from 
freezing to melting. We make a comparison between the classification results of 
DAU-Net and the [km MASIE-NH products of NSIDC. Experiments show that: 
1) the dual-attention mechanism enhances the representative ability of features and 
help the DAU-Net outperforms the origin U-Net and typical existing DL-based ice 
classification models, especially in the classification of fine-grained targets; 2) the 
three-channel inputs, dual-polarized information (VV and VH) and IA, contribute to 
high accuracy classifications; and 3) the DAU-Net is capable of dealing with com- 
plex sea state conditions from freezing to melting, showing good robustness and 
applicability. 

In the future, to address the misclassifications on unusual sea ice textures, we 
will collect more training samples from a wide range of space and time. We will 
also explore the possibility of integrating few-shot learning to solve the mentioned 
problem. Besides, the multi-category classification models to discriminate MYI, sea 
ice, and open water will be will become a follow-up work. 
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1 Introduction 


Harmful algal blooms (HAB), e.g., Yellow sea green algae, are disastrous ecological 
events in coastal oceans. During the blooming period, the rapid biomass increase 
severely impacts the coastal ecosystems and even the Olympic regatta games in 
2008 [6, 13-15, 17, 20, 22, 23]. 

Satellite remote sensing is a suitable means for green algae (U. prolifera) obser- 
vation and analysis because of the frequent data acquisition and broad coverage area 
[5, 7]. Existing studies mostly use passive optical-sensor images of 250-1,000 m 
resolution., e.g., Moderate Resolution Imaging Spectroradiometer (MODIS). The 
floating U. prolifera modulate the ocean color properties to make sea surface appear 
the prominent algae features in optical images [2, 8, 10, 21]. Active Synthetic aper- 
ture radar (SAR) images provide sea surface roughness with a resolution of tens 
of meters. The floating algae on the sea surface behave like a volume-scattering 
hard object, and the algae patch area’s reflected signal is much stronger than that 
backscattered one from the surrounding water, which appears as brighter regions 
in SAR images. SAR has become another option for detecting algae because some 
SAR images have become free and open, e.g., the European Space Agency (ESA) 
Sentinel-1 and Chinese Gaofen-3 data. For optical-sensor images, biological index 
methods, e.g., NDVI (Normalized Difference Vegetation Index) and FAI (Floating 
Algae Index), are commonly used [1, 5]. For SAR-sensor images, previous studies 
usually use grey, roughness or backscatter coefficient difference to identify the tar- 
get [4, 12]. However, these methods cannot effectively fuse the information from the 
optical and SAR images since the physical mechanisms of optical- and SAR-sensors 
for U. prolifera detection are very different. Based on the algae’s characteristics in the 
two sensors’ images, deep-learning (DL) offers a possibility to perform data fusion 
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[12]. U. prolifera algae have the thalli’s hollow tubular structure. During its blooms, 
some parts of the algae body are exposed above the sea surface, while other parts are 
submerged below the surface. An optical sensor can collect spectral information at a 
certain sea depth to effectively capture the floating and underwater part of the algae 
[7, 11]. The SAR sensor captures only the floating part on the sea surface. Thus, we 
can define the floating and submerged algae ratio (FS ratio), i.e., part of SAR-sensor 
detection/part of optical-sensor detection. The objectives of this research include 1) 
proposing a DL network to detect U. prolifera from optical and SAR images better, 
and 2) using the defined FS ratio to represent algae life stages. 


2 Data and Methodology 


2.1 Satellite Images and Labels 


We collected geometrically and radiometrically corrected 250 m spatial resolution 
MODIS true-color imagery (Bands: 1/4/3) containing algae patches in the Yellow 
Sea, and these MODIS images are under clear sky conditions from 2008 to 2021. 
Compared to the surrounding seawater, the U. prolifera algae show more prominent 
green slick/patch features (Fig. la—d). Using the Labelme software [16], we can label 
sample image slices containing different algae shapes for DL algorithm development. 
Finally, 1,055 pairs of MODIS labelled samples were obtained, and 680/292/83 pairs 
were used as training/validation/testing sets. 

We also collected Sentinel-1 Level-1 GRD (Ground Range Detected) dual- 
polarization (VV, VH) interferometric wide images with 10 m spatial resolution 
and 250 km swath and the Chinese GaoFen-3 SAR Fine Stripe Mode II (FSII) dual- 
polarization (HH, HV) image with 10 m resolution and 100 km swath between 2015 
and 2019. All SAR images were processed with speckle filtering and geometric, 
radiometric, orthometric, and terrain corrections to improve image quality using the 
Sentinel Application Platform (SNAP) 7.0 software. The algae patches show bright 
spots/slicks in SAR images (Fig. le). We marked 4,071 pairs of the algae labelled 
samples; 2,086/895/1,090 pairs were used as training/validation/testing sets. 


2.2 UNet-Based Algae Detection Network (AlgaeNet) 


We propose a DL-based model, AlgaeNet, to detect the algae patches better. Figure 2 
shows the model’s system diagram based on the U-Net framework [12, 16]. Optical 
and SAR images are input to the DL model separately,and the corresponding detec- 
tion result of the optical (SAR) image is Algae coverage-1 (2) through the improved 
model. Then the model can perform data fusion based on the two sensors’ detection 
results, and the FS ratio can be estimated by Algae coverage-2/Algae coverage-1. 
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Ground truth Predict value 


Fig. 1 U. prolifera algae blooms in Yellow sea and algae detection examples at the pixel level in 
green tide bloom period. a the random selected MODIS true-color images on June 25, 2008; b 
is the marked ground truth by manual; ¢ is the predicted value of the classic U-Net model; d is 
the corresponding predicted value of the AlgaeNet model; the white dots are algae pixels, and the 
black are the background ocean; e is an algae detection example based on the AlgaeNet model in 
Sentinel-1 SAR images 
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Fig. 2 AlgaeNet model design. Algae coverage-1 (2) is based on MODIS (SAR) images 


During the DL architecture design, we should pay particular attention to main- 
taining the tradeoff between optimization and generalization of the network. The 
overfitting of the algae detection model is usually prevented mainly through the fol- 
lowing three methods: dropout, weight regularization, and batch normalization. We 
found that batch normalization (BN) and weight regularization were beneficial for the 
network of the three technologies. BN provided any layer with zero mean/unit vari- 
ance in the DL model [9]. The initialization type of weights could cause a digression 
to gradients, meaning the gradients have to compensate for the outliers. BN regular- 
izes the gradient by normalizing activations throughout the network. It prevents small 
parameter changes from amplifying into more significant and suboptimal changes in 
gradients’ activations. L2 weight regularization is also added to each hidden layer. 
During optimization, L2 regularization adds penalty items to model parameters or 
activation values in the hidden layer, limiting the model parameters too much/too 
large to avoid the network being too complicated. These penalty terms will be used 
as the network’s ultimate optimization goal. 
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Table 1 Performance of AlgaeNet model 
Input data Al-Model | Output (%) 


Accuracy | Precision | Recall F1 score IoU 

Classic 58.04 37.89 

U-Net 

AlgaeNet | 97.51 66.61 55.41 60.50 42.62 
Sentinel-1/GF-3 99.83 95.35 92.04 93.67 88.09 
SAR 

Random 99.39 72.95 87.96 79.96 66.60 

Forest 


2.3 Model’s Performance 


The performance evaluation of the AlgaeNet model includes the assessments of 
the algae detection performance for MODIS and SAR images, respectively. For 
evaluating the AlgaeNet-MODIS model, Table I shows that the performance of the 
AlgaeNet-MODIS model is better than the original U-Net model; the AlgaeNet- 
MODIS (U-Net) model reached 97.51 (96.37)%, 66.61 (62.96)%, 55.41 (53.84)%, 
60.50 (58.04)%, and 42.62 (37.89)% in the five commonly used indicators of Accu- 
racy, Precision, Recall, Fl_Score, and Mean Intersection over Union (IoU). For the 
evaluation of AlgaeNet-SAR, the model reached 99.83, 95.35, 92.04, 93.67, and 
88.09% in the five indicators, which are significantly better than AlgaeNet-MODIS. 
Figure | also gives a visual presentation of the algae detection performance in the 
U. prolifera blooming period. Finally, we compared the model’s further with the 
recent neural networks: Random Forest (RF) models. Table 1 shows that our model 
has significantly higher performance than the RF model and indicates the excellent 
portability of the particular improvement strategy in the networks. 


3 Results and Discussion 


The AlgaeNet model was used in MODIS and SAR images to examine the algae 
coverage changes in 2020 and 2021. Figure3 shows that the maximum biological 
coverage in 2021 is nearly four times that of 2020. This significant difference has 
attracted widespread attention, and it is related to nutrients, sea surface temperature, 
sea surface salinity, seaweed planting valve area and valve frame recovery time, 
species competition, etc. [3, 18, 21]. 

We used the AlgaeNet model to process the collected MODIS and SAR images and 
acquired twelve pairs of spatiotemporally matching MODIS and SAR images/slicks. 
Figure 4 shows that the algae patches captured by MODIS and SAR sensors have a 
highly consistent spatial distribution pattern (Fig. 4b-c). In addition, we also found 
one interesting detail: for the big algae patches/slicks with a high aggregation degree, 
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Fig. 3 The detected maximum algae coverage in 2020 a and 2021 b in Yellow Sea 


the margin of the algae patches observed by the MODIS sensor is broader than that 
observed by the SAR sensor. That is due to the unique floating mechanism of the 
algae body. The U. prolifera algae has the thalli’s hollow tubular structure and floats 
on the sea surface; some parts of the algae body are exposed above the sea surface, 
while others are submerged below the surface. Therefore, optical sensors can collect 
spectral information at a certain sea depth to effectively capture the underwater part 
of the algae [7, 11]. On the other hand, the SAR sensors capture only the floating 
part on the sea surface. Thus, we can estimate the floating and submerged algae ratio 
(FS ratio) of U. prolifera algae. 

As shown in Fig.5, the FS ratio reflects the changes in the floating status of 
U.prolifera. Based on the algae distribution, coverage, and biomass results of the 
collected MODIS and SAR images from 2008-2021/2015-2019, the U. prolifera 
bloom originated from the Subei Shoal and drifted northward experienced different 
phases from initiation, development, maintenance, and decline. At the various stages 
of the U. prolifera bloom, the floating U. prolifera underwent morphological changes. 
At the initiation phase in the Subei Shoal, the U. prolifera algae had a large proportion 
submerged in seawater [19] and rare algae biomass. Based on the two matching 
MODIS AND SAR image pairs, the FS ratio of the algae body was less than 5% 
(Fig. 4). During the development phase, the biomass of U. prolifera rapidly increased. 
A large proportion of U. prolifera became floating due to the optimal illumination 
and temperature, and therefore FS ratio quickly increased to 24.75%, and some 
local areas even reached more than 40%. During the maintenance phase of the U. 
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| Detected algae pixels in SAR image Gal Detected algae pixels in MODIS image 


Fig. 4 Algae FS ratio estimation. a detected algae pixels between MODIS and SAR images; b and 
c corresponds to the enlarged view of two randomly selected sub-areas 


prolifera bloom, the U. prolifera algae moved northward, and the biomass and FS ratio 
remained at a high level, basically unchanged, shown as the dotted box of ~21.35%. 
During the decline phase of the bloom, there were almost no algae near Subei Shoal, 
and the FS ratio of algae patches in the Yellow Sea decreased rapidly to 14.33%. 
Therefore, in the entire life phase, the FS ratio of the U. prolifera had a parabolic 
process from increasing, maintaining, and then decreasing. The rates of increasing 
(initiation phase) and decreasing (decline phase) were high-speed compared to the 
development and maintenance phases. 
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Fig. 5 FS ratio: a life status indicator of U. prolifera algae 


4 Conclusions 


This chapter establishes an improved DL model for detecting U. prolifera algae in 
MODIS and SAR images, and the model has a high detection accuracy, i.e., 97.51%, 
and Mean IoU to 42.62% for MODIS images and 99.83% and 88.09% for SAR 
images. The detection results show that the maximum biological coverage in 2021 
is almost four times that of 2020 due to various natural and manufactured reasons. 
Besides, we can take the FS ratio as an excellent indicator to reflect the life status of 
floating U. prolifera algae. 
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1 Introduction 


Coastal zones are ecologically essential and exceptionally dynamic. Monitoring these 
regions is essential for coastal environmental protection and development. The water- 
line, also called shoreline or coastline in the coastal zones, is defined as contact 
between land and the water body. It plays an essential role in analyzing land/water 
resources, monitoring coastal erosion [3], as well as global sea-level rise. 

Clouds easily contaminate optical remote sensing waterline detection. Waterline 
extraction from synthetic aperture radar (SAR) imagery is becoming more common 
due to the radar’s all-weather and all-day capability. However, distinguishing the 
waterline in SAR images is not as simple a procedure for visible-band sensors. The 
wind-roughed and wave-modulated water return can frequently equal or exceed the 
return from a nearby land area, resulting in an inadequate contrast for unambiguous 
land-sea separation. In addition, affected by the moisture of the sandy sediments 
[6], this phenomenon is more evident in some tidal flat areas. Besides, the speckle 
noise generated by the coherent signal-scattering complicates the waterline extraction 
problem for SAR images. 

Since the remote sensing data has been growing exponentially and the manual 
delineation is labor-intensive and subjective, several automatic or semi-automatic 
waterline extraction methods for SAR images have been proposed based on two 
conventional approaches: edge detection [10, 13, 16, 19, 28] and image segmentation 
[9, 14, 22, 24]. However, no matter which one they are based on, these methods more 
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or less require preprocessing and postprocessing for an accurate extraction result from 
SAR images [9, 24]. 

Recently, deep convolutional neural networks (DCNN) have widely been 
employed to extract information from remote sensing images [11]. Several machine- 
learning-based methods have been proposed for waterline or coastline extraction from 
SAR images, which all show far better results than the conventional edge detectors 
[1, 8, 29]. However, unlike regular land or ice regions, the SAR imaging of tidal flat 
areas shows dramatic brightness changes under different sea conditions. 

In this chapter, a modified U-Net has been used to create a framework for automatic 
waterline extraction from Sentinel-1 SAR images of a large-scale tidal flat at Subei 
Bank in the Southern Yellow Sea. The extracted waterlines are continued to be applied 
to construct the digital elevation model (DEM) series in different years for evolution 
analysis of tidal flat using the waterline method. In this chapter, we first describe our 
study area, the unique palm-like Radial Sand Ridges along the Jiangsu coast, and the 
various sandbanks’ SAR imaging features under different sea conditions. Afterward, 
we introduce our input data and the DCNN-based method. Finally, after testing the 
trained model’s performance, we developed a processing chain for constructing the 
tidal flats DEM with the automatically extracted waterlines and an assimilative ocean 
tide model. 


2 Study Area and Data 


The Jiangsu coast is located in the western part of the South Yellow Sea, and its 
offshore area is characterized by palm-shaped radial sand ridges (RSRs). The RSRs 
consist of more than ten prominent submarine sand ridges and have a unique radial 
palm shape with the central apex near Jianggang. This giant system is well-developed 
owing to the active tidal processes and abundant sediment supply from the river runoff 
[4]. It has a length of 200 km in the north-south direction and a width of 90 km in the 
east-west direction, with the water depth ranging from 0 to 25 m [27]. The complex 
hydrodynamic system [20, 30] makes the area’s topography changeable. As shown 
in Fig. 1b, there are several large-scale tidal flats distributed in the study area. 
Compared to optical imaging systems, the active microwave sensor acquires data 
independent from night and cloud cover, ensuring continuous study area acquisitions. 
The Sentinel-1 mission comprises a constellation of two polar-orbiting satellites, 
operating day and night performing C-band SAR imaging, enabling them to acquire 
imagery regardless of the weather [26]. The two satellites, Sentinel-1 A (launch on 
3 April 2014) and Sentinel-1B (launch on 25 April 2016), complement each other 
allowing six days revisit times or even less (in polar regions). With the support of 
Google Earth Engine [7], we collect 140 pre-processed Ground Range Detected 
(GRD) IW (interferometric wide-swath) mode with dual-band cross-polarization 
(VV and VH) and 10m spatial resolution Sentinel-1 SAR imagery from 2015 to 
2019 for the waterline extraction analysis in this chapter. Among the 140 images, 52 
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Fig. 1 a Overview of the study area. The elevation data are from the ETOPO1 (NOAA National 
Geophysical Data Center, 2009). b Sentinel-1B SAR image of the study area at low tide, imaged at 
Greenwich Mean Time (GMT) 09:54, 26 November 2019 


acquired in 2019 are used for training and testing our DCNN model and the remains 
for constructing the large-scale tidal flats’ DEM in Subei Bank. 

Besides the speckle noise, the accuracy and efficiency of the automatic extraction 
of waterlines in the study area are mainly interfered with by two other factors: the 
rapid local brightness changes in seawater and tidal flats. The SAR image represents 
a two-dimensional radar backscatter map of the ocean surface roughness. Therefore, 
some related processes (such as winds, internal solitary waves, currents, underwater 
topography, oil spill, rainfall, and eddies) that cause local roughness changes will 
drive apparent brightness or darkness in imaging. According to Zhang et al. [31], 
affected by wind and tidal currents, the imaging features of shallow water topography 
in our study area can often be captured by SAR. As shown in the northeast corner 
of Fig. 1b, the three underwater sand ridges are shown as narrow bright stripes (1 km 
wide) in this SAR image. The non-uniform SAR imaging of the sea surface is more 
evident in Fig. 2. These four sub-images are acquired under different sea conditions 
and show a considerable imaging difference from each other both on seawater and the 
tidal flats. The uncertain changes bring great difficulties to the automatic extraction 
of waterlines for these large-scale tidal flats. 
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Fig. 2 Four typical Sentinel-1 VV-polarized SAR image examples acquired at the different tidal 
level: a Sentinel-1B image acquired at GMT 09:54, 19 March 2019; b Sentinel-1A image acquired 
at GMT 09:55, 26 December 2019; ¢ Sentinel-1A image acquired at GMT 09:54, 23 November 
2016; d Sentinel-1A image acquired at GMT 09:55, 21 July 2017 


3 Methodology 


3.1 U-Net 


The DCNN to extract pivotal information from remote sensing images has been suc- 
cessfully applied in oceanography. Recently, Li et al. [11] established an improved 
U-Net network to efficiently and automatically extract different ocean process sig- 
natures in optical and radar images. The U-Net [23] is a modified fully convolutional 
network [15] initially developed for biomedical image segmentation. The network is 
based on the Fully Convolutional Network but extended to work with fewer training 
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Fig. 3 The U-Net architecture specially tailored for this chapter 


images to yield more precise segmentation. The network consists of a contracting 
path and an expansive path, giving it a U-shaped architecture. As shown in Fig. 3, 
the left contracting path is a typical convolutional network that consists of repeated 
application of convolutions, which are followed by a rectified linear unit (ReLU) 
and a max-pooling operation. During the contraction processing, the spatial infor- 
mation is reduced while the image feature is increased. The right expansive pathway 
combines a sequence of up-convolutions and concatenations with high-resolution 
features from the contracting path. One crucial modification in U-Net is that there 
are many feature channels in the upsampling part, allowing this network to propagate 
context information to higher resolution layers. Consequently, the expansive path is 
more or less symmetric to the contracting path, yielding a U-shaped architecture of 
this network. The main idea is to supplement a usual contracting network by suc- 
cessive layers, where upsampling operators replace pooling operations. Hence these 
layers increase the resolution of the output. A successive convolutional layer is able 
to learn to assemble precise output based on this information. 

As shown in Fig. 3, the U-Net’s last layer is 1x 1 convolution with the Sigmoid 
activation. Traditionally, the loss function of the original U-Net is the cross-entropy. 
However, in the task of waterline extraction, the samples are highly unbalanced, i.e., 
the background samples’ numbers are much higher than those of waterline samples 
(less than 1% points in whole SAR images). Motivated by Lin et al. [12], we adopt 
the a-balanced cross-entropy in this task. 


3.2 Data Preparation 


The original spatial resolution of the dual-polarized Sentinel-1 SAR imagery down- 
loaded from Google Earth Engine is 10m. After statistical analysis, we found that 
the boundary lines between land and water on the VV polarization images are more 
apparent than those of the VH images. To save computing resources and training 
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time, we only use the VV polarization images and downsample them to a resolu- 
tion of 50m. Finally, a full SAR image of the study area is 2229 pixels high and 
2005 pixels wide. We further crop the images and their corresponding ground truth 
into sub-images with 256 x 256 pixels size to keep memory consumption low during 
training (the edge is filled with black when it is less than 256 pixels). In the end, we 
acquired a total of 3024 pairs of images for training the U-Net. In addition, before 
training the network, data augmentations are performed to compensate for a lim- 
ited number of images in the training dataset. Data augmentation is a technique to 
increase the amount of data by adding slightly modified copies of already existing 
data, including random contrast, brightness change, image rotation/cropping, noise 
injection, etc. It may help the network learn more tidal flat waterline features in the 
SAR imagery with protean brightness and shapes. 

Ground truth labels are necessary when we train a machine learning classifier. 
Since there is no corresponding waterline product and a method that can automatically 
extract these edges, we use manual drawing to obtain the ground truth value of the 
waterline of the 52 Sentinel-1 SAR images acquired in 2019 (the depicted result 
is shown as the output in Fig.3). In practice, we use a stylus and touch screen to 
represent the position of the waterlines accurately. We first randomly select 1/5 of 
52 pairs of images, that is, ten pairs as the testing set, to examine the accuracy of 
extracting the waterlines in the independent data of the trained model. The remaining 
42 SAR images with their labels are used for U-Net model training. 


3.3 Training 


The cropped sub-images from 42 Sentinel-1 SAR imagery are divided into 80% for 
training and 20% for validation in the training process. The training and testing of 
the network are implemented by Keras/Tensorflow framework (on NVIDIA Tesla 
V100 GPU, 32 GB). As mentioned above, we adopt the a-balanced cross-entropy as 
the loss function (4 is set to 0.99) and the classification accuracy as the performance 
metric. Furthermore, the batch size is set to 16, and the number of epochs is 4000. 
Finally, the classification accuracy of the 20% validation images is 94.45% after 
nearly ten hours of training. 


4 Results 


4.1 Model Performance 


The binary classification accuracy is estimated by calculating the precision and recall 
of the automatically extracted waterlines to manual ones. The mean precision and 
recall of the ten testing images are 0.92 and 0.77, respectively (see Table | for details). 
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Table 1 Details of ten testing Sentinel-1 SAR images 


Image ID Satellite Imaging time | Tidal level(m) | Precision Recall 

(MM/DD 

HH:mm, 

GMT) 
1 Sentinel-1A = | 02/17 09:55 | —3.32 0.96 0.76 
2 Sentinel-1A | 04/30 09:55 | —1.39 0.93 0.78 
3 Sentinel-1B =| 05/30 09:54 | —1.40 0.93 0.75 
4 Sentinel-1A = | 06/05 09:55 | —1.28 0.91 0.69 
5 Sentinel-1LA =| 06/17 09:55 | —2.40 0.92 0.83 
6 Sentinel-1A = | 06/29 09:55 | —1.52 0.89 0.78 
7 Sentinel-1A | 07/11 09:55 0.97 0.93 0.84 
8 Sentinel-1B | 07/29 09:54 |—1.95 0.94 0.79 
9 Sentinel-1A | 09/09 09:55 | —0.47 0.95 0.81 
10 Sentinel-1A | 12/26 09:55 | —3.02 0.93 0.74 
Mean 0.92 0.77 


Four examples of the ten testing results under different sea conditions are shown 
in Fig.4. We use three-color lines to compare the difference between the model 
results and the true values. Yellow represents the waterlines accurately extracted by 
our DCNN-based model. Red indicates the missing parts of the model, while blue 
means the false detected lines that shouldn’t be there. The fluctuation of the tides 
causes drastic changes in the shape and distribution of the waterlines. Figures 4a- 
c show the results under three typical tidal levels: high, medium, and low, which 
can also be judged from the exposed area of the tidal flats. What’s interesting here 
is that Fig. 4c captures a small amount of Enteromorpha information shown as the 
little bright spot in the northern sea. As shown by the yellow lines in Fig. 4, most of 
the obtained extraction results from the DCNN-based model correspond well to the 
manually annotated ground truth waterlines. 


4.2 Automatic Topographic Mapping of Tidal Flats 


Knowledge of a waterline’s orientation, position, and outline is essential in sea 
autonomous pilot, verification of coastal platform’s attitude and place, the geolo- 
cation of ships, geographic mapping, etc. It also has a specific application for con- 
structing a digital elevation model (DEM) of an intertidal zone by the waterline 
method [16]. This method is first introduced by Mason et al. [17]. The waterline can 
be regarded as a quasi-contour line of the topography. This method was proved to 
be one of the best methods that provide an excellent trade-off between accuracy and 
cost-effectiveness for the DEM generation of tidal flats [18, 24, 32]. 
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Fig. 4 Four examples of the ten testing images overlaid with their corresponding trained model 
extraction results and ground truth waterlines: a Sentinel-1A image acquired at GMT 09:55, 11 July 
2019; b Sentinel-1B at 09:54, 29 July 2019; c Sentinel-1A at 09:55, 17 June 2019; d Sentinel-1A 
at 09:55, 05 June 2019. The mean precision and recall of the ten testing images are 0.92 and 0.77, 
respectively 


Automatic Waterline Extraction of Large-Scale Tidal Flats from SAR Images ... 295 


Ocean Tidal 
Prediction Model 


DCNN-Based 
Model 


Automatic Extraction 


Multi-temporal 
Tidal Flats Waterlines 
Fig.5 The flowchart of the method for automatic topographic mapping of the tidal flats developed 
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This study further attempts to establish a method for automatic topographic map- 
ping of tidal flats based on the waterline method and the DCNN-based waterline 
extraction model for SAR images. The flowchart of this method is shown in Fig. 5. 
The elevation generation process can be divided into four steps: 


1. Gaining the waterline information in a series of Sentinel-1 SAR images showing 
different tidal levels automatically by the trained DCNN-based model; 

2. Discreting the lines into points and estimating their Lon/Lat position from original 
images; 

3. Evaluating the water level of each point by the ocean tidal prediction model at 
the SAR imaging time; 

4. At last, interpolating the resulting grid of quasi-contour lines to a DEM map. 


According to the previous subsection, the DCNN-based model performs well, with 
little or no postprocessing required to obtain accurate waterlines, even for large-scale 
tidal flats like the Subei Bank. In addition, our method has extremely high extraction 
efficiency, with an average of two seconds per SAR image (2229 x 2005 pixels, 
based on the NVIDIA Tesla V100 32GB GPU). According to Zhang et al. [31], the 
TPXO tide model [5] perform well in the tidal phase in our study area. However, 
this tidal model presents a systematic underestimation of tidal amplitude. Then, the 
in-situ water level data from two tidal gauge stations in our study area were used 
to calibrate this tide model (see [31] for details). The corrected TPXO tide models 
with Tidal Model Driver software are employed as the ocean tidal prediction model 
to evaluate the tidal level for each point of each waterline on this method. 

We first used the waterlines of 2019 to verify the accuracy of the waterline method 
in measuring tidal flats elevation in our study area. We eliminated five scenes with 
wind speed greater than 10 m/s, which have a large offset from their original location 
caused only by tidal fluctuation [25]. Then the remaining waterlines from 47 SAR 
images was assigned with the tidal level value using the corrected ocean tidal model 
(see Fig. 6). 
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Fig. 6 Assembled tidal level 
evaluated waterlines 
extracted manually from all 
SAR images acquired in 
2019 
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Finally, as shown in Fig. 7, these points/lines were interpolated to obtain gridded 
DEM of the large-scale tidal flats in our study area. One transect line of measured 
topographic data, which were acquired by an in-situ survey in May 2019, was used 
to test the accuracy of the derived DEM. The mean absolute error along this transect 
line is about 0.3 m (see Fig. 8). 

Among the waterline method steps, the most time-consuming one is to extract 
the waterline, especially for SAR images. With the support of the DCNN-based 
automatic waterline extraction model, the efficiency of implementing this method 
can be significantly improved. We took the generation of the tidal flats? DEM for 2018 
as an example. A total of 29 pre-processed Sentinel-1 SAR images throughout the 
year were collected with Google Earth Engine and used as inputs to the DCNN-based 
model to obtain the geolocation of waterlines quickly. The final gridded DEM result 
for 2018 is shown in Fig. 9a after the same subsequent processes such as tidal level 
evaluation and spatial interpolation. In addition, interannual topographic changes can 
be analyzed by subtracting these two waterline-derived DEMs. As shown in Fig. 9b, 
the topography of these large-scale tidal flats changes significantly in two years under 
the action of strong tidal currents [2]. The erosion-deposition balance showed a net 
deposition of 0.12km? from 2018 to 2019. It implies our presented methodologies 
are also suitable for rapid monitoring the morphological and sedimentary changes 
of large-scale intertidal areas. 
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Fig. 7 The derived DEM 
result of Subei Bank for 2019 
using the waterline method 
(overlaid on the Sentinel-1A 
SAR image acquired at GMT 
09:55, 20 November 2019) 
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5 Discussions 


Because of the frequent lack of consistent, sufficient intensity contrast between land 
and water regions and the complications of distinguishing waterline from other object 
boundaries, waterline extraction is harrowing with most general-purpose edge detec- 
tors or image segmentation techniques, especially for radar images in the intertidal 
areas. Previous studies used edge detection methods where a thresholding process 
was necessary at some point under relatively complex imaging conditions (such as 
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Fig. 9 a The derived DEM result of Subei Bank for 2018 using the waterline method based on 
the waterlines automatically extracted by our DCNN model; b Tidal flats DEM of difference map 
between 2019 and 2018 (both overlaid on the Sentinel-1A SAR image acquired at GMT 09:55, 20 
November 2019) 


the methods developed by [16, 21], and [9]). In addition, with the unprecedented 
amount of data containing waterline information available, an automatic extraction 
method should be prioritized. The DCNN-based method developed in this study 
performed well for automatic waterline extraction from SAR imagery in large-scale 
tidal flats area under changeable imaging conditions. 

With the support of big data platforms such as Google Earth Engine and the ocean 
tidal prediction model, we developed a waterline method-based workflow that can 
quickly obtain relatively accurate DEM of tidal flats after extracting multi-temporal 
waterlines from SAR images under different tidal levels. This technique provides an 
efficient method for the rapid analysis of large-scale tidal flat topography evolution, 
which is of great significance for applying SAR images to monitoring coastal terrains. 


6 Conclusions 


This chapter proposes a DCNN-based method to extract waterlines automatically 
from SAR images. Our approach shows a relatively high extraction accuracy for the 
waterlines in complicated large-scale tidal flats (the mean precision and recall are 0.92 
and 0.77, respectively) and efficiency (several seconds per image) simultaneously. 
This chapter also presents the first attempt for intertidal DEM generation of the Subei 
Bank using the waterline method by analyzing high spatial resolution SAR images. 
The DEM results show that, in general, there is a good agreement between the derived 
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elevation and in-situ topographic data, implying that the waterline method based on 
SAR images can be used for large-scale tidal flats such as the Subei Bank area. 
Furthermore, based on the waterline extraction model and the waterline method, 
we developed a novel workflow for automatic topographic mapping of large-scale 
tidal flats, which has excellent potential for rapid analysis of intertidal topography 
evolution. 
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by Deep Learning geons 


Yibin Ren, Xiaofeng Li, and Huan Xu 


1 Introduction 


Ship detection is very important to marine transportation [5]. Space borne Synthetic 
Aperture Radar (SAR) has been one of the most critical data source for ship detection 
because it can penetrate the clouds and track objects in all kinds of weather [28]. 
In marine applications, ship recognition from SAR imagery has long been a hotspot 
[4, 9, 19]. With the advancement of image analysis technology, SAR images can be 
used to derive more detailed ship information [8]. The size of a ship provides basic 
information for ship classification [11]. And the size information can provide useful 
information for ship classification. The intricate geometric parameter estimate is also 
a part of the interpretation of SAR image. A method for extracting ship size that is 
both efficient and precise will bring a new concept for SAR image interpretation. 
Ships, in general, are metallic objects that may reflect SAR sensor electromagnetic 
radiation significantly more strongly than the surrounding ocean. On SAR images, 
one ship can be identified as a bright back scattering intensity target with high normal- 
ized radar cross-section (NRCS) values. The minimum bounding rectangle (MBR) 
is a geometric characteristic of the ship’s NRCS that offers a preliminary size for 
determining a ship’s ground size. In the meantime, the ship’s superstructure, sea-ship 
interaction, and imaging conditions all have an effect on the NRCS. Li et al. [11]. 
These factors lead to a large gap between the initial size and the ground size. Figure | 
shows several examples of ship’s signature on SAR images, the size of the MBR, 
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Fig. 1 Examples of ships on SAR images. a/d/g/j ship signature on SAR image; b/e/h/k labeled 
MBR of ship signature and the MBR’s size; ¢/f/i/l the ship’s ground size 


and the ground size of the ship. The MBR is labeled by visual interpretation. The 
difference between the MBRs and the ground size appears to be clear. As a result, 
precisely extracting ship size from SAR images is difficult. 


2 Traditional Methods 


2.1 Typical Procedure of Traditional Methods 


The majority of classic techniques for extracting ship size from SAR images have 
three stages (Fig.2): (1) binarization, (2) initial size extraction, and (3) accurate 
size estimation. Binarization divides the pixels in the SAR image into two groups: 
ship signatures and non-ship signatures. The binary result is then converted into an 
MBR in the second phase. The length and width of the created MBR are used to 
determine the ship’s starting size. Finally, a regression model is used to determine 
the accurate ship size using the initial size and other relevant factors such as the 
maximum and minimum NRCS of the ship signature. Statistical/machine learning 
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Fig. 2. The procedure of the traditional algorithm for ship size extraction from SAR images 


(ML) methods, such as linear regression, non-linear regression, and kernel-based 
methods, are commonly used in regression models. 


2.2 Representative Traditional Methods 


Stasolla and Greidanus [26] used Constant False Alarm Rate (CFAR) to binary the 
SAR image. CFAR is a common method [21, 29, 30] that separates ship signatures 
and backgrounds. Further, to extract the ship’s MBR, they used the mathematical 
morphology method to refine the signature. They adopted the MBR’s length and 
width as the ship’s final length and width without a third step. They tested their model 
with 127 available ship samples from Sentinel-1 images. The mean absolute error 
(MAE) of length is 30 m (relative error 16%), and the MAE of width is 11 m (relative 
error 37%). In 2018, Li et al. [11] estimated the ship’s size of the OpenSARShip [7]. 
The ship signature was obtained using a threshold-based approach. They use an 
image segmentation procedure to refine the ship signature and determine the original 
ship size. Finally, the gradient boosting model is employed to estimate the accurate 
ship size. The MAE of the length and width, according to experiments, is 8.80 m 
(relative error 4.66 percent) and 2.17 m (relative error 7.01%), respectively. 


2.3 Issue to be Further Addressed 


The accuracy of ship size extraction is improving as years roll on. The standard 
three-step procedure is quite complicated. Binarization and initial size extraction 
need advanced image processes in order to meet the next estimation stage [11]. The 
third stage is similarly difficult [20]. The inaccuracies caused in each stage will add 
up and eventually compromise the accuracy of the final size extraction. It is possible 
to build new approaches to increase ship size extraction accuracy and efficiency in 
the era of big data. 

Deep learning (DL), as the cutting-edge AI technology, has made great achieve- 
ments in computer vision [10].Multiple neural network layers make up a typical DL 
model. It accepts raw data as input and learns the essential characteristics automat- 
ically to perform classification or prediction [25]. End-to-end learning is the term 
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for this process. DL simplifies feature engineering and is well suited to modeling 
massive data and complex interactions when compared to traditional machine learn- 
ing. DL has been successfully employed in oceanography, geography, and remote 
sensing in recent years [12, 13, 22, 24, 31, 32]. DL proposes novel approaches to 
the problem of estimating the size of a ship. 


3 Deep Learning Method 


3.1 Ship Detection Based on DL 


A deep convolution neural network (CNN) is a subtype of DNN that is made up 
of CNN layers. CNN-based models have had a lot of success in target detection. 
Researchers proposed CNN-based ship detection models, such as models based on 
faster region-based convolutional network (Faster-RCNN) [23], single-shot multi- 
box detector (SSD) [15], and you only look once (YOLO) [2]. Orientation is an 
important characteristic of a ship. Several researchers suggested a rotatable bounding 
box (RBB) to replace the usual non-rotating RBB, such as DRBox [14] and DRBox- 
v2 [1]. 

For the ship detection task, DL has become the first choice. DL-based models 
achieve end-to-end detections with higher accuracy and robustness over conventional 
models. However, for ship size extraction, there is almost no application of deep 
learning. Therefore, developing an end-to-end DL model is necessary. 


3.2 SSENet: A Deep Learning Model to Extract Ship Size 
from SAR Images 


SSENet is a new end-to-end DL model that replaces the previous three-step process 
for extracting ship size from SAR data. The model uses DRBox-v2 to create the 
ship’s RBB from the SAR image and a DNN-based regression model to estimate the 
accurate ship size. The DNN-based regression model is proposed using a hybrid input 
and a loss function termed mean scaled square error (MSSE), which considerably 
increases ship size estimation accuracy. 


3.2.1 Overall Structure of SSENet 


SSENet’s overall structure consists of three phases (Fig. 3): (1) RBB generation; (2) 
accurate ship size estimation; (3) MSSE loss calculation and overall model opti- 
mization. The SAR chip is used as input in the first stage, which uses a deep CNN 
model called DRBox-V2 to automatically detect the ship’s RBBs. The RBB with 
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1. Generating rotated bounding box. 2. Estimating ship size based on a DNN model 3. Calculating MSSE loss. 
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Fig. 3 Structure of SSENet 


the highest confidence is chosen as the initinal RBB. A DNN model is used in the 
second stage to estimate ship size. The DNN model takes two types of data as inputs: 
(1) the initial length, width, and orientation angle, and (2) the SSD feature map. The 
accuracte ship’s length and width are generated using the DNN model. 


3.2.2 Generating RBB for the Ship 


The DRBox-v2 is used to generate RBB for the ship [1]. Its input is a 300 x 300 
pixels SAR image, and its output is a series of RBBs. DRBox-v2 contains two sub- 
modules: a feature extraction module and an output module. The feature extraction 
module extracted abstracted features. Here, the VGG16 is employed as the feature 
extraction module. The VGG16 consists of five feature extraction units. Two stacking 
CNN layers make up the first feature extraction unit, while a max-pooling layer and 
two stacking CNN layers make up the others. Each feature extraction unit produces a 
three-dimensional feature map as its output. Five feature maps named F4, F2, ...... ; 
Fs are generated. The number of channels in the F';-F's feature maps is 64, 128, 256, 
and 512. The pooling kernel is 2 x 2. After on max-pooling layer, the spatial size of 
a feature map is downscaled as 1/2 size of its original size. As the input SAR image 
is 300 x 300 pixels, the spatial size of F;-F'5 feature maps is 300 x 300, 150 x 150, 
75 x 75, 38 x 38, and 19 x 19 pixels. 

The output module generates output maps by convolutioning feature maps Oy, 
Fig. 2b. There are two outputs for one SAR image: the confidence of being a ship, 
as well as the geographic offsets of prior RBBs. A softmax function activates the Op 
to obtain the confidence output. A sigmoid function activates the Op to obtain the 
location offsets. Three feature maps (F2, F3, and F4) are fused to generate Or.FPN 
is used to combine different feature maps. The cross-entropy and the smooth L1 loss 
[15] are used as the confidences loss and geographic loss for DRBox-v2. 
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Following the first process, a ship’s candidate RBBs are collected, providing 
beginning references for the future exact size estimation. 


3.2.3 Estimating Ship Size Based on a DNN Model 


There are two elements to the DNN model’s inputs, as shown in Fig. 3c. The initial 
ship size and orientation angle, which are determined from the best RBB and give 
primary and direct information for correct ship size regression, are the first part. The 
DRBox-v2 generates a sequence of ship RBBs. As the best RBB, the RBB with the 
highest confidence value is chosen. The initial ship size is the length and width of 
the best RBB. Furthermore, the best RBB’s orientation angle is the ship’s orientation 
angle, as shown in Fig. 4. It has an impact on the SAR image’s ship signature [7, 11]. 
As the orientation does not distinguish between the bow and the stern of one ship, 
we transform the angle’s range to (—90°, 90°]. 

The other component of the inputs is the feature map derived from the input SAR 
image. In typical environmental conditions, the ship’s signature in the SAR image 
reflects the sea clutter. It indicates whether the ship is moving or stationary. During 
the SAR integration time, a moving target is frequently found in several resolution 
cells. Smearing and brightness loss in the SAR image are caused by the dispersion of 
backscattered energy. A moving ship’s signature reveals an azimuth displacement. 
The SAR system receives the Doppler signal from the scatter in the azimuth direc- 
tion. A stationary ship’s azimuth position is identical to the azimuth position of a 
SAR platform. The Doppler shift, on the other hand, has an extra component for a 
moving ship, resulting in an azimuth change in the ship signature. The environmental 
conditions during satellite imaging, such as wind fronts, ocean waves, and rain cells, 
alter the ship’s signature on the SAR image. Under typical conditions, the sea-ship 
interaction produces a complicated ship motion in the real world and a polarimetric 
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Fig. 4 Illustration of the ship orientation. a Coordinate system; b An example of a ship chip 
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Fig. 5 Transforming the feature map Fs as inputs. a F5 feature map. b Compressing Fs in the 
channel dimension and obtain the F5 m. c Compressing F5 m in the spatial dimension and obtain Fe. 
d Flattening Fe as one-dimensional input vector 


scattering signature with a wide range of polarimetric scattering processes [14, 16, 
17]. In reference [11], the relationship between the status of the ship, the surround- 
ings, and the ship’s size has been demonstrated. The abstracted feature map derived 
from the input SAR image contains the factors stated above. Therefore, the feature 
map F's in Fig. 3b is employed as the other component of the input. 

Fs is a three-dimensional feature map with 512 19 x 19 pixels channels. The 
input vector contains 184,832 (512 x 19 x 19) elements, which brings training 
difficulties for the fully connected DNN regression model. It is necessary to make 
some transformations to reduce the dimension of F's. 

As shown in Fig. 5a ,b, we transform F5 by a CNN layer with 1 x 1 x N convo- 
lutional kernels, obtaining Fsm. Compared with F5, the channel number of F sy is 
reduced from 512 to N, Fig. 4b. F5 is compressed in channel dimension. Then, an S 
size max-pooling is performed on the new feature map Fsm, and a new feature map 
Fis obtained, Fig. 5c. The spatial size of the Fg is [19/S]. The values of N and S are 
defined by experiments. Finally, F6 is flattened as a one-dimensional feature vector. 
The flattened vector is concatenated with the initial width, length, and orientation to 
form the inputs of the DNN model, Fig. 3c. 

As shown in Fig. 3c, to perform regression, three hidden NN layers are used. There 
are 256 neurons in each NN layer. The parameter-tuning experiment produces the 
number of hidden NN layers and the number of neurons. The rectified linear unit is 
the activation function of each layer. Two neurons are stacked on the last hidden NN 
layer to form an output layer. A sigmoid function is stacked one the output layer to 
transform the estimated values to 0-1 and output the estimated width W, and the 
estimated length Ly, Fig. 3c. 


3.2.4 Calculating MSSE Loss and Optimizing SSENet 


The MSSE loss function is used in the DNN regression model. For most regression 
issues, the mean square error (MSE) is acommonly used loss function. The definition 
of MSE is shown in Equation (1): y; represents the ground truth, y; represents the 
prediction value, and N means the number of values to be predicted. The loss value 
calculated by MSE and the ground truth value have no relation. Assume a ship’s 
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ground length and width are 100 and 50 m, respectively, and the predicted length 
and width are 80 and 30 m, respectively. Both the length and width MSE values are 
400. Because the model is optimized based on loss values, both the length and width 
losses contribute equally to the model’s optimization. In practice, a ship’s length is 
much greater than its width. In most cases, the length is more concerning than the 
width. In order to increase the length estimate accuracy, we hope that the length loss 
helps to optimize the model more than the width loss. 


N 
1 a2 
MSE = — > (x = yi) a) 
ic 2 
MSSE = — hel ee J) 2 
7 2 y (x y (2) 
Sizeross = MSSE, + MSSEw (3) 


MSSE loss function solves the mentioned issue. MSSE incorporates the ship 
length and width ground truth into the traditional MSE. The ground truth is utilized 
as a dynamic parameter to scale the square error. The definition of MSSE is shown in 
Eq. (2): yi, y; and N is the number of all samples. The MSSE length and width losses 
in the example are 40,000 and 20,000, respectively. The loss in length is substantially 
greater than the loss in width. As a result, the penalty for the model’s length will be 
increased during the training phase. Therefore, the optimization procedure is more 
conducive to length estimation. Based on Eq. (2), the loss of length MSSEz and the 
loss of width MSSEw are calculated. The size loss (Sizez,s;) is the summation of 
MSSE; and MSSEw, Eq. (3). 

Besides Sizez,;;, the confidence loss (Conf Loss) and the location loss (Locdzo5;) are 
another two losses calculated in the first stage, Fig. 3b. Conf ross is the cross-entropy 
loss, and Locdazoss is the smooth L1 loss [1, 23]. Their definitions are as follow: 


N 
Conftoss = > ci loge; + (1 — ci) log (1 — c;) (4) 


i=1 


0.5x2, if lx] <1 
|x| — 0.5, otherwise 


N 
1 
Locazoss = N Xo smooth; ; (x;)= | (5) 


i=1 


where N is the number of predicted targets, c; is the ground confidence of a sample, 
c, is the predicted confidence of a sample, and x; is the element-wise difference 
between the ground RBB and the predicted RBB. The three losses, Sizézos;, Conf Loss» 
and Locazross, are added to form the final loss that optimizes SSENet integrally. 
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3.3 Experiments on SSENet 


3.3.1 Experiments Data 


The OpenSARShip dataset (http://opensar.sjtu.edu.cn/) is a Sentinel-1 ship inter- 
pretation dataset that includes 11,346 SAR ship chips and automatic identification 
system (AIS) messages. The ground size for each ship is provided via the AIS. The 
ground range detected (GRD) of IW is the picture mode of Sentinel-1. The spatial 
resolution of the SAR image is around 20m, with a pixel spacing of 10m. SNAP 
3.0 performs radiometric calibration and terrain correction. The amplitude values of 
pixels for VH (vertical emitting and horizontal receiving) and VV (vertical emitting 
and vertical receiving) polarizations are stored on each SAR chip, which has one 
ship and two channels. The experiment set for SSENet includes 1,890 samples in 
the VV mode. Figure 6 shows the distributions of ground ship’s length and width. 
The length ranges from 28 to 399 m. The width ranges from 6 m to 65 m. Each SAR 
chip is 300 x 300 pixels in size. We transform the values of SAR images to [0, 255]. 
The training set consisted of 1,500 SAR chips chosen at random. The remaining 390 
chips will be used for testing. 

The ground truths for the experimental set include two parts: the ground ship size 
and the RBB for each ship. The ground size is obtained from the OpenSARShip. The 
RBB for each ship is labeled manually by a Matlab tool shared in DRBox-v2. The 
DRBox-v2 is trained to generate accurate RBB based on the ground RBB. 
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Fig. 6 The range of length and width of the testing set 
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3.3.2 Experiments Setting 


A workstation with one GeForce RTX 2070 8GB GPU is used in the experiment. 
Python 3.6 is the programming language used. TensorFlow is a deep learning pack- 
age. For training, the batch size is six. 0.0002 is the initial learning rate. The learning 
rate reduces by half every 5,000 training epochs during the training procedure. When 
the Sizezos5 < 0.001, the Locazoss < 0.005, and the composite loss < 0.01, the train- 
ing procedure stops. 

MAE and the mean absolute percentage error (MAPE) are employed as metrics. 
MAE is atypical absolute error, and MAPE is a widely used relative error. Assuming 
y; is the ground truth, y; is the estimation value, and N is the number of samples, the 
definitions of MAE and MAPE are as follow: 


N 
1 } 

MAE=—) i— y 6 
Nl” yi (6) 

N 1 

100 Yi — Yi 
MAPE (% =—) i 7 
(%) N 2a) yy (7) 


3.3.3 Performance of SSENet 


The hyper-parameters of SSENet are determined by parameter tuning, and a well- 
trained model is picked up to be evaluated. The 390 samples of the testing set are fed 
into the well-trained SSENet. The outputs are the scaled lengths and widths estimated 
by the model. The scaled values are rescaled to normal values. 

The estimated ship sizes are shown in Fig.7a, b. The length and width MAEs 
are 7.88 and 2.23 m, respectively. The MAEs of the estimated length and width are 
pushed under 0.8-pixel spacing. The MAPE of estimated length and width are 5.53 
and 8.93%, respectively. The R? score are 0.9773 and 0.9093. This indicates that the 
estimated ship length/width is quite close to the ground length/width. The R? score 
of widths is smaller than that of length, which means the width is difficult to estimate 
than the length. There are two factors that contribute to this phenomena. A ship’s 
width is far smaller than its length. The width of the ship’s signature on the SAR 
image is more ambiguous than the length [26], which causes random errors in the 
width of the labeled RBB. Second, the MSSE loss function makes the model fit the 
length better. 

We plot the relationship between the labeled RBB’s size and the ship’s ground 
size, as shown in Fig. 7c, d. The labeled RBB is treated as the RBB closest to the 
ship’s signature for visual interpretation. As shown in Fig. 7c, the MAE of length is 
nearly 40 m, and the MAE of width is more than 50 m. The gap between the labeled 
RBB’s size and the ground size is large. By adding the regression model, SSENet 
pushes the MAEs under 8 m. Therefore, the proposed regression model based on 
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Fig. 7 Relationships between the SSENet’s estimated size, and the size of the labeled RBB. a and b 
The relationships between the ground size and the SSENet’s size. c and d The relationships between 
the ground size and the labeled RBB’s size 


DNN is necessary and effective. Figure 8 shows some examples of SSENet’s results. 
The outputs of one sample include the detected RBB, the confidence score to be 
a ship, and the estimated ship size. For most ship samples, the estimated sizes are 
consistent with the ground sizes. 


3.3.4 Effectiveness of the Inputs 


The efficiency of the inputs for the DNN regression model is tested. The results are 
shown in Table 1. Three compared models employ different inputs. The inputs for 
SSENet; include initial ship size, without feature map F6. For SSENetz, the inputs 
are initial ship size and F6. Based on the three inputs, SSENet3 adds the initial 
orientation as another input. 
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Fig. 8 Some examples of SSENet, the outputs include the detected RBB, the confidence score to 
be a ship, and the estimated ship size 


Length: 149.00 m 
Width: 21.00 m 


Table 1 Model performance with different inputs 


Models Inputs of DNN| MAE (m) MAPE (%) 
Length Width Length Width 
SSENet; Xi, Xw 10.01 2.65 6.87 9.89 
SSENet2 X71, Xw, F6 8.14 2.27 5.82 9.22 
SSENet3 Xj, Xw, cos, | 7.88 2.23 5.53 8.93 
F6 


The results are displayed in Table 1. SSENet; obtains the largest MAE and MAPE 
among the three models. By adding F6, SSENet, reduces the length’s MAE about 
2 m compared with SSENet;. This finding illustrates that the feature map of a SAR 
image is an important input for estimating ship size. Adding the feature map as 
an input improves the accuracy of size estimation. Finally, by explicitly including 
the ship’s initial orientation as another input, the estimation errors are significantly 
minimized. Therefore, each element of the inputs for SSENet shows contributions to 
the final size estimation. Figure 8 shows several results of SSENet;, and the red/green 
rectangle is the labeled/detected RBB. The estimated confidence score to be a ship 
and estimated the size by SSENet are also displayed. 
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Table 2 Performance of with MSE or MSSE 


Models Loss Function | MAE (m) MAPE (%) 

Length Length 
SSENetysze | MSE 8.85 2.20 5.99 8.81 
SSENetyissz | MSSE 7.88 2.23 5.53 8.93 


3.3.5 Effectiveness of MSSE Loss 


An experiment is conducted to test the effectiveness of the new loss function, 
MSSE. The results are shown in Table 2. SSENetysg is the model with MSE loss. 
SSENetyssz is the model with MSSE loss. The other parts of the two models are 
the same. 

The results are shown in Table 2. The length MAE of SSENetyssz is nearly 1m 
less than that of SSENetysz, reducing by 11%. For the width, SSENetyssz per- 
forms slightly worse than SSENetys_¢. The reason for this is that MSSE emphasizes 
a significant loss and drives the model to focus on length rather than width. The 
difference in width between the two values, however, is only a few centimeters. The 
disadvantages of MSSE are not overshadowed by the aforementioned constraint. As 
a result, our MSSE loss is helpful, particularly when evaluating the ship’s length. 


4 Discussions 


4.1 ML versus DL 


SSENet’s regression model is a DNN model. We choose three typical ML models, 
Gradient Boosting Regression (GBR) [6], Support Vector Regression (SVR) [3], 
and Linear Regression (LR) [18] to discuss their performances. GBR and SVR are 
applied in ship size extraction [8, 11]. LR is a baseline model [27]. Because these 
three ML models aren’t NN-based, they can’t be combined with the SSD to create 
an end-to-end model. The SAR images cannot be fed into the three ML models. The 
inputs for these three models are the initial ship size and orientation of the labeled 
RBB. The parameters of GBR, SVR, LR are tuned and the estimation results with 
the best metrics are recorded. The DNN model is used by SSENet. 

The results are shown in Table 3. The result of SSENet is in the last row. GBR per- 
forms the best among four models (LR, SVR, GBR, and DNN). GBR is an ensemble 
learning model with good performance in the three-stage procedure [26]. However, 
GBR is unable to extract features from SAR images automatically. GBR cannot be 
combined with a DL-based ship detection model, such as DRBox-v?, to create an 
integrated ship size extraction model. The premise of using GBR is that the SAR 
image should be binarized accurately, and the initial RBB is well extracted by tra- 
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Table 3 MAE and MAPE of different models 


Models Inputs MAE (m) MAPE (%) 

Length Length 
LR RBB 10.60 2.86 6.97 10.47 
SVR RBB 10.16 2.48 6.61 9.50 
GBR RBB 9.69 2.39 6.93 9.27 
DNN RBB 10.79 2.83 7.27 10.22 
SSENet Images 7.88 2.23 5.53 8.93 


ditional methods. As stated in Sect. 2, the traditional method faces big challenges. 
Practically, GBR is not an end-to-end model: feeding the SAR image and obtaining 
the ship size. 

The error of DNN model is large. However, a DNN model can be combined 
with any deep learning models based on CNN or NN to extract size from the SAR 
image from beginning to end. In contrast to traditional techniques, the DL model 
optimizes all parameters globally. The DNN regression model can use the feature 
maps extracted by the DL model to increase the accuracy of the estimated ship size. 
As shown in Table 3, the SSENet reduces the MAE of length by nearly 2 m compared 
with the GBR, about 18.68 %. Therefore, compared with traditional methods, the 
ship size extraction model based on deep learning is more practical. 


4.2 Errors’s Sources 


This section delves into the details of estimation errors and attempts to determine what 
causes large inaccuracies. The ship’s direction and transit speed are two elements 
that need to be investigated, according to previous research [10, 26]. 


4.2.1 Ship Orientation 


The estimated errors with respect to the ship’s orientation angle is displayed. Figure 9a 
and b show the results of the length. Fig. 9c, d show the results of the width. The MAEs 
vary with the ship orientation variation. Large MAEs occur when the orientation 
angles are closer to 0° (0° means the azimuth direction) in the range of (—45°, 45°]. 
The reasons for the above phenomenon include two aspects. The first one is the ship 
motion. When the ship moves in a direction that is near to the azimuth direction, 
the azimuth direction’s speed component is large. Because of the large component, 
the ship signature appears to be smeared, increasing the estimation error. The other 
reason is the unequal resolution during imaging, 5 m x 20 m for range and azimuth 
directions, respectively. The low resolution in the azimuth enlarges the errors [26]. 
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Fig. 9 The trend of errors with respect to the orientation angle. a Trend of length’s MAE; b Trend 
of length’s MAPE; ¢ Trend of width’s MAE; d Trend of width’s MAE 


As shown in Fig.9, when the initial orientation angle (cos@) is added to the 
DNN model, the errors are reduced. This finding also proves that using the original 
orientation angle as an input is valid. 


4.2.2 Ship Speed 


Figure 10 shows the errors corresponding to the ship’s speed. Because the Open- 
SARShip’s SAR images are mostly from ports, around 83% of the ships are still 
there. Figure 10a shows that the MAEs are small in the range of (0, 1) knot (1.852 
km/h). With the increase of ship speed, the MAE fluctuates slightly. When the speed 
is greater than 15 kn (27.780 km/h), the MAEs increase apparently: 19.04 and 4.71 
m. These two values are far greater than those of other speed intervals. The ship’s 
speed cannot be derived from the SAR image signature. Therefore, it is difficult to 
refine the estimated sizes of ships by pre-input the ship’s speed parameter. 
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4.2.3 Ship Size 


Figure 11 shows the absolute error (AE) of each estimated and the ground size. 
The AE of a estimated size takes the absolute value of the difference between the 
predicted value and the true value. As shown in Fig.s 11a and b, there are no obvious 
relationships between AE and ground size. Therefore, the ship size is not a source 


of errors. 
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5 Conclusions 


SSENet, a DL-based model for extracting ship size from SAR data, is proposed in 
this chapter. A DNN-based regression model and an SSD-based model make up the 
SSENet. The DNN model is fed the initial ship size and orientation angle derived 
from the RBB, as well as the high-level features extracted from the input SAR 
image. The OpenSARShip trains and validates SSENet. Experiments show that: (1) 
the SSENet can straight extract ship size from SAR images with MAE less than 0.8 
pixels; (2) the new MSSE loss reduces the length’s MAE nearly | m than the old 
MSE loss; (3) SSENet shows obvious advantage over the GBR model; (4) SSNet 
exhibits robustness over four separate data sets. 
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1 Overview 


1.1 Backgrounds 


Deep-sea organisms are those living below the ocean belt, and they can be divided 
into three categories according to their living styles, including plankton, swimming 
organisms and benthos. Deep-sea biological resources are an essential part of the 
marine ecosystem and play a vital role in the formation, maintenance, and develop- 
ment of marine ecosystem. The deep-sea biological resources are the foundation of 
marine ranch construction and aquatic development [34]. The problems such as the 
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establishment of deep-sea protection areas, the sustainable utilization of resources, 
and the maintenance of vulnerable marine ecosystems based on species diversity 
have become the hot spots in global deep-sea research. The research on the distri- 
bution and diversity of deep-sea organisms is helpful to promote human cognition 
of the ecosystem and plays a vital role in the maintenance of the marine ecosystem. 
Due to the year-round darkness of the deep-sea area where the sunlight is difficult to 
penetrate, the high salinity, the considerable pressure, the low water temperature, the 
number of biological species is relatively small. In contrast, the biological quantity 
is numerous in some intensive biological areas. Therefore, it is crucial to solving the 
practical problems by using modern technology. 

Species discovery and identification are crucial ways to explore deep-sea biodi- 
versity. To better protect marine ecology, we can monitor the health status and bio- 
diversity of the benthos ecosystem by analyzing the species, quantity, and growth. 
Traditional methods of marine biological identification are based on morphology and 
molecular genetics and sometimes even need to use the advanced DNA sequencing 
technology supported by electron microscope. Although this method is accurate, 
there are still two main problems for marine species classification. On the one hand, 
it costs a large amount of human and financial resources to cultivate professional 
taxonomy experts for marine species, and artificial identification has low efficiency. 
On the other hand, the special ocean environment is unsuitable for in-situ detec- 
tion during scientific research using molecular and electron microscopy methods, 
and heterotopic detection can lead to biological inactivation and species death. To 
solve the problems above, the application of the target detection technology based 
on the deep neural network in marine species identification and quantitative analysis 
emerged. 

Considering the problems, including the difficulties in underwater target recogni- 
tion caused by complex marine imaging environment, brutal penetration of sunlight, 
high salinity, the high similarity of some detected targets, and uneven distribution 
of biological density [26], the static counting of dense marine biological communi- 
ties, and automatic real-time dynamic detection and counting algorithm of marine 
benthos were explored and studied in this paper. It is significant in helping marine 
biologists identify marine species, evaluate the population density, improve the oper- 
ational performance of underwater autonomous robots and promote the underwater 
operation and the development and ecological protection of marine resources. 

Seamounts are relatively isolated conical peaks or groups of peaks in the various 
oceans and are also an essential part of the marine environmental system. Seamounts 
rise from the seafloor but do not protrude from sea level. There are an estimated 30,000 
seamounts worldwide, but only a few have been studied. However, seamounts have 
become one of the most popular systems in deep-sea research in recent years because 
of their unique topographic and hydrological features and their unique ecosystems, 
rich biodiversity, and excellent resource value. 

Recently, our country has successfully carried out a series of seamount explo- 
rations represented by “Jiaolong’ manned HOV and ‘FaXian’ ROV, and obtained 
many first-hand submarine image data and samples in the South China Sea, Western 
and Central Pacific. It not only significantly improves the level of deep-sea detection 
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of our country but also provides data support for automatic detection of benthos in 
seamounts. Section 4 will use deep learning technology to identify and detect the 
giant benthos in seamounts. 


1.2 Related Works 


Due to the influence of the medium, the propagation distance of light and radio waves 
is very limited in seawater. In contrast, the propagation performance of sound waves 
in water is much better, covering a wider sea area. However, the acoustic signal 
will propagate along different paths because of the reflection, refraction, and other 
phenomena, so the underwater target recognition based on sonar echo technology has 
many interferences and low accuracy. The target recognition method of sonar image 
has the characteristics of high resolution and real-time performance. Therefore, in 
underwater target recognition and detection, the current research mainly focuses on 
the underwater target detection of sonar images for a long time. 

Traditional sonar image detection algorithm mainly extracts features from sonar 
images and then classifies and locates the target. Extracting enough information fea- 
tures is the key to detect the underwater target. In this way, researchers proposed a 
series of hand-designed feature extraction methods, such as Scale Invariant Feature 
Transform (SIFT) [24, 25], Histogram of Oriented Gradient (HOG) [5]. The features 
are extracted effectively, and then recognized by algorithms like Morphology, Fuzzy 
Clustering and Markov Random Field [29]. The manual feature extraction, classifi- 
cation, and detection methods have a good recognition effect in specific application 
scenarios. Still, these algorithms have poor scalability and low generalization ability, 
which different features need to be designed for different problems. Therefore, the 
application value of the algorithm is limited. 

With the development of underwater high-definition imaging technology such 
as ROVs (Remote Operated Vehicles) and AUVs (Autonomous Underwater Vehi- 
cles), the data of close-range targets collected by optical imaging equipment can be 
analyzed by computer vision algorithm without sonar images. It makes the feature 
information of the target more fully retained and used, and the accuracy and efficiency 
of target detection are greatly improved. In this trend, Fish4Knowledge [7] project 
has collected 115 TB of underwater high-definition image/video data and proposed 
many methods to detect fish in the underwater video for assessing fish biodiver- 
sity [39]. In fish detection, SIFT [1, 28, 37] or SC (Shape Context) [35] algorithms 
have been widely used to calculate marker features. But in the reference [47], the 
author concludes that HOG algorithm is better than SIFT and SC algorithm. Marcos 
et al. [27] used Normalized Chromaticity Coordinates (NCC) histogram to extract 
color features, and Local Binary Pattern (LBP) feature descriptor to extract texture 
features of the coral image. Stokes and Deane [40] proposed coral classification Dis- 
crete Cosine Transform and K-Nearest Neighbor classifier algorithm. Although the 
upgrades of underwater acquisition devices improve the quality of data, the analysis 
algorithm continues to use SIFT and HOG to extract features and then use Support 
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Vector Machine (SVM) [4] and Adaptive Boosting (AdaBoost) [42] to classify. The 
problem of poor robustness of manual feature extraction has not been effectively 
solved. 

Due to the continuous development of deep learning in recent years, all aspects of 
computer vision are have been promoted. Especially, Convolutional Neural Network 
(CNN) algorithm, Fast R-CNN algorithm, and Feature Pyramid Network (FPN) 
algorithm, which are widely used in image classification, image annotation, and 
multi-target detection, enable people to obtain rich deep semantic information of 
images and improve the accuracy of image classification and recognition signifi- 
cantly. The target detection and recognition method based on deep learning benefits 
from CNN’s strong feature autonomous learning ability on large-scale data sets and 
can effectively solve feature extraction in the above method. Therefore, it has been 
successfully applied to many underwater target detection and recognition scenes [18]. 
Kratzert and Mader [16] used the marine fish channel monitoring platform based on 
CNN algorithm to detect targets without using any artificial features, and the final 
fish classification accuracy reached 93%. Huang et al. [15] applied Faster R-CNN to 
detect and identify marine organisms, expanded a small number of samples through 
three data enhancement methods and verified the effectiveness of Faster R-CNN 
in biological detection in different marine turbulent environments. Xia et al. [43] 
proposed a sea cucumber detection scheme based on YOLOv2 model, which has a 
good detection effect on sea cucumbers with a regular shape or simple natural scene 
coverage. Although these methods have achieved some success, they are applied to 
specific target scenarios and do not include the study of target quantity statistics. 


1.3 Research Content and Innovation 


With the change of economic and marine environment, the value of marine bio- 
logical resources is enormous. To further improve the ability of marine resources 
utilization and marine ecological protection, advanced information technology, and 
data analysis ability are needed to provide accurate data support and decision sup- 
port for relevant personnel. After in-depth research on big data technology, artificial 
intelligence, and other technologies, combined with the existing business needs, the 
deep-sea biological identification quantitative model was designed and realized in 
this chapter. 

The key to the success of the deep-sea biometric quantitative system is the extrac- 
tion and application of data. The fast extraction of data and big data requires a stable 
and reliable algorithm basis. Data acquisition, extraction, conversion, cleaning, and 
data loading are used to enter the data storage layer. A deep-sea biometric quanti- 
tative database is formed by deep learning technologies such as data analysis. The 
computing layer can provide robust image classification and recognition, realize 
deepseated analysis of deep-sea biological data, fully excavate the hidden value of 
data, and provide support for quantitative recognition of deep-sea organisms. In this 
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chapter, Faster R-CNN and SSD are adopted respectively to achieve marine biometric 
identification and quantification for different scenarios. 

Considering the current situation of deep-sea biological resources, and in order to 
realize the requirements of automatic identification and quantitative analysis of deep- 
sea organisms and detection of giant benthic organisms in seamounts, the following 
functions are studied and realized mainly in this chapter: 


1. The deep-sea biological recognition and quantitative analysis system is con- 
structed to process a large number of deep-sea biological image data, analyze 
the deep-sea biological data, extract biological features, classify and quantita- 
tively analyze the deep-sea biological recognition by using deep learning and 
other artificial intelligence technology. 

2. According to the high-definition seamount image data taken by the research ship 
during the investigation of a seamount in the Western Pacific Ocean, the seamount 
biological training library is constructed. On this basis, the SSD target detection 
model is trained. The feasibility of automatic real-time seamount species detection 
and counting was studied under the condition of the trade-off between speed 
and accuracy. 63 high-quality images of seamount macrobenthos in the Western 
Pacific are constructed and manually labeled. They can be used to train various 
deep learning models, which alleviates the lack of training data for marine species 
to a certain extent, and is helpful for other people in the same field. 


2 The Target Detection Techniques 


2.1 Introduction on Target Detection 


In computer vision and image processing, Target Detection is an image segmentation 
technology that scans and searches for specific semantic targets (such as people, 
buildings or cars) in digital images or videos and marks them. Generally speaking, 
it is not only to identify which category the target belongs to, but also to get its 
specific position in the picture. Target Detection is widely used in computer vision 
tasks, such as automatic image annotation, behavior recognition, face recognition 
and video target segmentation. It can also be used for target trackings, such as the 
ball in a football match or the players on the court. 

Traditional target detection is usually based on the traditional machine learning 
method, which is generally divided into two stages: firstly, SIFT, HOG, and other 
methods are used to extract features, and then, SVM, AdaBoost, and other algorithms 
are used for classification. However, there are two main problems in traditional target 
detection methods: (a) feature extraction is not targeted, and time complexity is high; 
(b) the features designed manually are not robust to the change of diversity. Therefore, 
when the detection task changes, the features need to be redesigned. 

In recent years, with the help of Deep Neural Networks (DNN), the target detec- 
tion algorithm based on DNN has gradually replaced the traditional target detection 
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algorithm. In computer vision tasks, DNN based target detection and recognition 
algorithms are mainly divided into two categories: one is a region proposal-based 
target detection algorithm, that is, a two-stage detection algorithm. In the first step, 
a series of sparse candidate regions are generated by a certain method, and in the 
second step, the candidate regions are further classified and regressed. Typical repre- 
sentatives of such algorithms are R-CNN [9], Fast R-CNN [8], Faster R-CNN [33], 
Mask R-CNN [14], etc. Due to the low recognition error rate and missing recognition 
rate, the two-stage target detection algorithm has achieved excellent performance on 
several challenging benchmarks including Pascal VOC [6] and MS COCO [21]; The 
other is the single-stage target detection algorithm, which skips the stage of gen- 
erating candidate regions and directly generates the class probability and position 
coordinate value of targets. The final detection result can be obtained through a single 
detection. Therefore, compared with the two-stage algorithm, it has a faster detec- 
tion speed. There are many typical algorithms, such as YOLOv1 [32], YOLOv?2 [30], 
YOLOv3 [31], YOLOV4 [2], SSD(Single Shot Multibook Detector) [23], RetinaNet 
[22], RefineDet [46], CornerNet [17], etc. The advantage of a single-stage target 
detection algorithm is high detection efficiency, but its detection proficiency often 
lags behind a two-stage algorithm. 


2.2 The Single-Stage Target Detection 


The core idea of the single-stage target detection algorithm is to take the whole image 
as the input of the network, and apply regression on the position and category of 
Bbox in the output layer directly. The primary representative is SSD [23] and YOLO 
(You Only Look Once). In this paper, we use the SSD to complete the detection 
of seamount macrobenthos. As a result, the model has a simple structure and fast 
speed. The following focuses on the SSD algorithm and illustrates the principle of 
single-stage target detection. 

SSD (single shot multibox detector) [23] is the first single-stage detector of a single 
shot. It abandons the practice of Faster R-CNN using RPN to generate boundary boxes 
and classify them and puts forward the ideas of multi-scale features and default boxes. 
Similar to other single-stage detectors, its speed is better than two-stage detectors. 
SSD algorithm is an algorithm with high speed, high accuracy, and high robustness to 
scale change. Its main feature is to use multi-layer convolution features with different 
scales and receptive fields for target detection and recognition. 

SSD algorithm is based on a feedforward convolutional neural network. The 
algorithm first generates a series of fixed number of default boxes. It then uses the 
corresponding feature graphs of different levels to predict the location and category 
based on these default boxes. For all the predicted bounding boxes of each category, 
the redundant and low probability bounding boxes are removed by the non-maximum 
suppression algorithm. Finally, the detection results are generated. This method is 
a target detection algorithm based on regression, which imultaneously predicts the 
location and category within a network framework. Compared with R-CNN series 
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algorithms, SSD is a single-stage, end-to-end target detection algorithm, and the 
detection speed is greatly improved. Moreover, multi-layer convolution layers with 
different scales are used for target detection and recognition due to their unique 
design. As a result, the detection performance has been improved to a certain extent. 

SSD network framework is divided into the base net and extra feature layers. 
The basic network is a truncated VGG network. The additional layer is the CNN 
layer with a gradually decreasing scale, and the detection of targets is carried out 
simultaneously on these characteristic maps with different scales. Feature maps of 
different scales are used to predict targets of different scales. 

The input of the SSD is a 3 channel RGB image. Firstly, the algorithm will 
map a series of default bounding boxes (default boxes), according to the size of the 
feature map, and then convolute through a series of convolution cores. Each layer will 
produce a fixed number of predictions, including 4 position predictions and several 
category predictions. The default box mechanism is similar to the anchor boxes 
mechanism in the Region Proposal Network (RPN) in Faster R-CNN. For a p-channel 
feature map with m x n size, the convolution kernel with scale 3 x 3 x p is used 
to predict the category and location information at each location m x n. Category 
prediction will predict a score value for each category, representing the category 
target’s possibility in the corresponding box. The position prediction will predict the 
scale scaling and displacement change based on the corresponding default box, which 
is the position adjustment based on the default box according to the characteristics 
of CNN. The default box is a series of rectangular default boxes corresponding to 
each position m x n on the original map according to the scale of different levels of 
the feature map. These default boxes have different sizes and aspect ratios to adapt 
to the scale transformation of the target to be detected. 

For the K default boxes of each position, the SSD algorithm uses convolution 
operation to predict c + 1 category scores (including C target category and a back- 
ground category) and 4 coordinate positions. That is (c + 1+ 4) x K, each position 
needs a convolution kernel, so for a scale of m x n. The characteristic graph of n 
needs a convolution kernel corresponding to (c + 1+ 4) x Kmn prediction output. 
Each location corresponds to a fixed number of default boxes, which have different 
sizes and aspect ratios according to the location and scale of the layer. 

During training, you need to match the truth value with the default box to produce 
positive and negative samples. SSD matches the positive and negative samples by 
calculating the Jaccard overlap of the default box and the truth box. The threshold is 
0.5. If the Jaccard overlap in the truth box is greater than 0.5, it is set as a positive 
sample, otherwise it is a negative sample. A truth box can match multiple default 
boxes. 

SSD has the following main features: 


1. Inherit the idea of transforming detection into expression from Yolo to complete 
target positioning and classification at one time. 
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2. Based on anchor in Fast RCNN, a similar prior box is proposed. 
3. Add the detection method based on the feature pyramid, that is, predict the target 
on the feature map of different receptive fields. 


2.3 The Two-Stage Target Detection 


Among the two-stage target detection algorithms, the R-CNN series is the most 
famous. This chapter mainly focuses on Faster R-CNN, and its predecessor is Fast R- 
CNN and R-CNN. We first briefly introduce R-CNN and Fast R-CNN target detection 
principles, and then focuses on the Faster R-CNN target detection algorithm. 

Given the two problems existing in traditional target detection algorithms (see 
Sect.2.2), Girshick proposed the R-CNN algorithm in 2014 [9]. Its principle is ele- 
mentary, mainly by extracting multiple candidate regions to determine the target’s 
position. The R-CNN target detection process is shown in Fig. 1. 

Because the traditional algorithm for detecting each sliding window is a way of 
wasting resources, the R-CNN model uses SS (selective search) image segmentation 
algorithm [41] to extract 1k-2k candidate regions from the bottom to up. These 
regions are converted into fixed-size images and sent to CNN respectively to extract 
the features of each candidate area. Then, the SVM classifier is used to classify the 
feature vectors extracted by CNN. Then the regression of the coordinates of the upper 
left and right lower corner of the candidate region is made to modify the location 
of the candidate region to achieve the target classification and get the boundary. 
R-CNN uses SS algorithm to generate higher quality ROI and CNN instead of the 
sliding window used in traditional target detection as ROI and manual feature design. 
It makes the target detection field achieve a significant breakthrough and open the 
upsurge of deep learning target detection. 


Shellfish? 
YES 


Fig. 1 The target detection process of R-CNN 
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Fig. 2 Fast R-CNN model structure 


But the classical R-CNN has the following problems: 


1. Due to the need of calculating features for each candidate region, the amount of 
calculation is very tremendous. 
2. The candidate regions are highly overlapped and there are too many repeated 
calculations. 
. Not end-to-end. 
4. Strict size requirement for the input image. 


05) 


In this case, SPPNet proposed by He et al. [12] successfully solves the problem of 
repeated convolution in R-CNN. However, the problems of multi-step training and 
large memory consumption still exist. Therefore, Girshick proposed the Fast R-CNN 
target detection algorithm in 2015 [8], and the target detection process is shown in 
Fig. 2. 

Fast R-CNN can input any size of pictures into CNN and get the feature map 
by convolution and pooling operation, which avoid the time-consuming operation 
of generating candidate regions before convolution in R-CNN. Like R-CNN, Fast 
R-CNN also uses an SS algorithm to obtain about 2K candidate regions, and then 
find the corresponding feature boxes of each candidate region in the feature map. 
However, different from that, Fast R-CNN introduces ROI (Region of Interest) pool- 
ing operation. Its input is the feature map and the frame of candidate regions with 
different sizes obtained by CNN. The size of the output is fixed. The role of the ROI 
pooling layer is to pool the corresponding region into a fixed-size feature vector in 
the feature map according to the position coordinates of candidate regions, to carry 
out the following softmax classification and Bbox (Bounding box) regression. 

Fast R-CNN abandons multiple SVM classifiers and Bbox regressors in RCNN 
and combines classification and regression in one network using a multi-task loss 
function. It also trains the whole network end-to-end and outputs the target’s Bbox 
value and category label, which improves the model’s accuracy. In addition, Fast R- 
CNN solves the problem of repeatedly extracting features by R-CNN, so the training 
speed has been significantly improved. 
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Feature Maps | 


a) 


Backbone 


Fig. 3 Faster R-CNN model structure 


In 2016, Ren et al. [33] proposed the Faster R-CNN target detection algorithm 
based on Fast R-CNN. Compared with Fast R-CNN, the most critical point of Faster 
R-CNN is using RPN (Region Proposal Network) instead of the SS segmentation 
algorithm to generate candidate frames, which significantly improves the speed of 
detection frame generation. In addition, Faster R-CNN integrates feature extraction, 
candidate region extraction, Bbox regression, and softmax classification into one 
network, which significantly improves speed and accuracy. Generally speaking, the 
improvement of Faster R-CNN to Fast R-CNN is that the speed of obtaining candidate 
regions is much faster. The Faster R-CNN target detection process is shown in Fig. 3. 

The network structure of Faster R-CNN is similar to that of Fast R-CNN. Firstly, 
the backbone network is used to extract the features of the input image. The backbone 
network can use ResNet [13], VGG16, etc. Then, the RPN network is used to obtain 
the offset of the candidate box relative to the anchor box and the probability of 
containing targets. The specific operation is: the RPN takes the output feature map 
of the backbone network as the input and convolutes it using the kernel of 3 x 3, and 
then performs 2 times of 1 x 1 convolution. The number of output channel is 2 x k 
and 4 x k respectively. Among them, k represents the number of prior frames anchor 
on each grid point, and RPN uses this k anchor to make k predictions; The output 
2 x k is the target score, which represents whether the predicted candidate box on 
each grid point contains the target and the probability of containing the target; The 
output 4 x k is coordinate information, which represents the offset of the predicted 
candidate frame on each grid point relative to the anchor frame; In Faster R-CNN,k 
is usually taken as 9. Finally, neural network and maximum pooling are used to 
calculate the pooled ROI feature map, and the result is reshaped into a vector | x n. 
Two fully connected layers are used for classification and regression to obtain the 
target location and classification information. 
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2.4 Summary 


This section introduces the theory of target detection firstly, then focuses on two- 
stage R-CNN series target detection algorithm and single-stage YOLO series target 
detection algorithm, especially Faster R-CNN algorithm and SSD algorithm, and 
introduces the advantages and disadvantages of single-stage algorithm and two-stage 
algorithm. Combining the different characteristics of the two algorithms, this chapter 
provides the basis for the subsequent discussion on target detection counting mod- 
els. Specifically, in Sect.3, the Detection and Quantification of Benthic Organisms 
(DQBO) is introduced. Section 4 presents Detection of Macrobenthos in Seamounts 
(DMS). 


3 DQBO Based on Faster R-CNN with FPN 


3.1 Introduction on DOBO 


Benthic density has always been an indispensable part of benthic target detection. 
By analyzing the images of marine benthic density, we can understand the social 
habits of organisms, help estimate the number of organisms and carry out a series 
of applications such as aquaculture and biotope protection. With the development of 
artificial intelligence technology and the depth of computer vision theory, intelligent 
image processing has become a critical research area. Although CNN-based target 
detection algorithms are widely used in many scenarios, the detection results do 
not meet all requirements and usually require more in-depth exploration. As shown 
in Fig.4, the number of organisms is dense and numerous. Counting the number 
is cumbersome and has a high labor cost, so it is of great practical significance to 
automatically count the image targets. 


Fig. 4 Benthic organisms density images 
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3.2 The Faster R-CNN with FPN Framework for DQBO 


How to deal with the large-scale change of objects is a fundamental problem in apply- 
ing target detection. Whether the RPN in Faster R-CNN or Fast R-CNN, it is both 
based on a single high-dimensional feature, which generally has a poor effect on small 
object detection. FPN mainly solves the problem of detecting small and medium- 
sized objects in object detection scenes. Connecting high-dimensional features with 
low resolution and high semantic information and low dimensional features with 
high resolution dramatically improves the performance of small object detection. 

This section embeds the FPN structure into the Faster R-CNN, combining it with 
the high-dimensional and low-dimensional feature extraction. Without increasing 
the amount of calculation of the original model, we successfully solve large-scale 
change and small object missing detection problems. The FPN network structure is 
shown in Fig. 5. 

Figure 5 @ shows the forward propagation process of the neural network from 
bottom to top. After convolution operation, the size of the feature map becomes 
smaller and smaller, and more and more abstract. A pyramid level is defined for each 
stage of FPN. The output of the last layer of each stage is selected as the reference 
set of the feature graph because the deepest layer of each stage has more robust 
semantic information. @) is a top-down process, making the higher-level feature 
graph more abstract and more semantic to enhance the higher-level feature. Because 
the feature maps used in each layer are fused with features of different resolutions 
and semantic intensities, it can detect objects with corresponding resolutions and 


Predict 


Predict 


Predict 


2X up 


ie 1G 


Fig. 5 The structure of FPN 
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Output data 


ResNet: The backbone network is composed of convolution layer, and FPN is added for feature 
extraction to fuse the high-resolution information of low-level features and high-level features. 


Feature Maps: After the feature extraction stage of fusing high-level features and low-level 
features, the multi-scale prediction feature map is obtained to adapt to different sizes of targets 


Region Propoals; The FPN is put into the RPN network, and then the general location of ROI 
regions of interest is framed on the feature map, that is, the output bounding box 


Classification and Regression:The highly abstracted features after multiple convolutions are 
integrated to predict the target's bounding box regression. 


Fig. 6 The structure of Faster R-CNN with FPN 


ensure that each layer has appropriate resolution and solid semantic features. @ is 
a horizontal connection process, which uses a convolution kernel 1 x 1 to fuse the 
result of @ with the output feature graph of Œ) without changing the size of the 
feature graph. 

In the detection process, the FPN structure is embedded in the Faster R-CNN 
feature extraction part. The framework of the target detection and counting model 
based on Faster R-CNN consists of the following three parts: feature extraction, 
candidate region generation and classification, and Bbox regression. The network 
structure of Faster R-CNN with FPN is shown in Fig. 6. 


1. FPN: Feature extraction 
To improve the recognition accuracy of different sizes of organisms in the image, 
the backbone in Faster R-CNN is replaced by ResNet50 which combines FPN 
instead of VGG16. Feature maps of different scales are obtained by the FPN 
and then sent to the RPN to generate candidate regions. The fusion of deep and 
shallow features makes the FPN structure effectively improve the detection rate 
of small targets. The multi-resolution feature map detection design makes Faster 
R-CNN have a better detection effect for different scale targets. 

2. RPN: Get candidate region 
RPN is a complete convolution network, which can be trained end-to-end, to 
generate the suggestion bounding box, which can predict not only the boundary 
of the object but also the probability score of the object. The network structure 
contains two types of outputs: Softmax classifier and Bounding box, a multi-task 
model. The core of RPN is the anchor. RPN is mainly used to generate candidate 
regions. However, the different sizes and aspect ratios of targets make it necessary 
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to set multiple-scale windows. RPN generates a large number of anchor boxes 
firstly. After clipping and filtering, Softmax further determines whether these 
anchors belong to the foreground or background, whether there are objects in the 
box. The foreground represents containing objects and the background represents 
not containing objects. At the same time, the other branch begins to modify the 
anchor frame to form more accurate candidate regions. 

The implementation process of RPN is as follows: firstly, a small network is 
used to perform sliding scanning operation on the feature image obtained by 
convolution, and it is connected fully with the window on the feature image, then 
it is mapped into a low dimensional vector, and finally, the vector is fed into the 
Bbox regression layer (reg) and Bbox classification layer (cls). The reg layer is 
mainly used to estimate the candidate output (x, y, w, h) corresponding to the 
candidate anchor. The cls layer is used to judge whether the candidate region is 
foreground or background. 

3. Target classification and Bbox regression 

Before the target classification and bounging-box regression, we need to carry out 
the pooling operation. This layer uses the candidate regions generated by RPN 
and the feature maps of different scales generated by the backbone network to 
get the fixed-sized candidate feature maps and inputs them into the subsequent 
network. We can use the full connection operation to identify and locate the target. 
In the classification process, Softmax is used as the classification function to 
classify the fixed-size feature image formed by the ROI pooling layer according 
to the specific category. At the same time, the L1 loss is used to complete the 
candidate regression operation on the bounding box for position verification to 
obtain the accurate position of the object. The loss function equations of the whole 
network is shown in = 1. 


L {pi}, {ti)) = Nets a aba Pi. pi*) Ta Pt reg (ti, ti *) (1) 


P; is the probability of the category of anchor calculated by the Softmax. When the 
IOU between the anchor and the target window is greater than 0.7, the value of p;* is 
1, and when the IOU is less than 0.3, the value is 0. t;* is a scaling parameter, which is 
the real scaling value for regression, including coordinate scaling and size scaling. t; 
is used to represent the scaling value predicted by the network in the training process. 
Faster R-CNN completes the regression task by learning the scaling value. The loss 
function consists of two parts: classification loss and regression loss. See Eq.2 for 
the calculation of classified loss: 


Las (ti, t;*) = — log (p;* pi + (1 — pi*) @ — pi) (2) 
See Eqs. 3 for the calculation of regression loss: 


Lreg (ti, ti*) = R (ti — t;*) (3) 
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Where is the loss value is calculated by smooth L1 function, see Eqs. 4: 


2 
. 1 
Smooth, = racial a (4) 
|x| — 0.5other s 


The general implementation process of the whole network is as follows: First, the 
input image is represented as Height x Width x Depth. The tensor is processed by 
the backbone network with an FPN structure to obtain feature maps of different scales. 
Then, the RPN is used to extract candidate regions. After obtaining the possible 
related objects and their corresponding positions in the original image, the features 
extracted from the backbone network and the bounding box containing the related 
objects are pooled by ROI, and the features of the related objects are extracted to 
obtain new vectors. Then it is sent to the subsequent classification and regression 
network to complete the target recognition and positioning. 


3.3 Experimental Results and Discussions 


In this experiment, the iterations are 120 and the batch size is 1. We set the learning 
rate and weight decay to 0.0001. Set the size and scale of anchor to (8, 16, 32) and 
(0.5, 1, 2). The impulse gradient descent method is used to reduce overfitting, and 
the impulse is set to 0.9. 

For the static image data, a total of 630 samples were obtained. Labeling image 
annotation tool is used to calibrate these samples manually, and then we divide the 
training, validation and test sets according to the ratio of 7:2:1. We used the test set 
to evaluate our model. 

In this paper, Recall, Precision and AP were used to evaluate the results of this 
experiment. The results are shown in Table 1. 

The definition of recall and precision is as follows: Eq. 5: 


i TP TP 
Precision = ————., Recall = ————— (5) 
TP+FP TP+FN 


Among them, the definition of T P, TN, FP, FN is shown in Table 2, which respec- 
tively represents true positive, true negative, false positive, and false negative recog- 
nition, Precision represents the correct proportion of all predicted targets, Recall 


Table 1 The experimental result 
Class Recall Precision AP 


Mussel 
Shinkaia 0.876 0.755 0.756 
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Table 2 Obfuscation matrix for specified categories 


Ground truth Predictive 

Positive Negative 
Positive True Positive (TP) False Negative (FN) 
Negative False Positive (FP) True Negative (TN) 


Mussel 


Qhinkaia 


Fig. 7 Marine benthos detection and quantification results. The tag 33/34 indicates that the real 
quantity of mussel is 34 and the detection quantity is 33 


represents the proportion of correctly located and recognized targets in the total 
number of targets. 

Finally, the mean average precision (mAP) of 73.8% is obtained on the marine 
biological data set. Among them, the accuracy of FPN method for mussels recognition 
is 72.0%, and the accuracy of shinkaia is 75.6%. The visualization of experimental 
results is shown in Fig. 7. It can be seen that FPN is an excellent static image counting 
model for marine organisms. 
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3.4 Summary 


This section first introduces the overall structure of the Faster R-CNN model and 
describes the network structure in detail. Next, the structure of the convolution neural 
network used for feature extraction is introduced, and 441 images are trained by this 
network. The experimental results show that the Faster R-CNN recognition has good 
effect, and can be applied to the quantitative analysis of marine biological recognition. 


4 DMS Based on SSD 


4.1 Introduction on DMS 


In a narrow sense, seamount refers to the submarine uplift with a height of more than 
1000 m below sea level. In a broad sense, the sea-knolls with a height of 500—1000 
meters and hills below 500 m are called seamounts. Seamounts are the significant 
ecological landscapes in the deep ocean. It is estimated that the global seamounts 
account for 21% of the global seabed area [3, 45]. With the unique topography and 
hydrological characteristics, as well as the unique ecosystem, abundant biodiversity, 
and colossal resource value, seamounts have become one of the most concerning areas 
in deep-sea research. Compared with the surrounding deep-sea area, seamounts have 
high productivity, high biomass, and high biodiversity. 

With the change of water depth and sediment types, the biological communities 
of seamounts show obvious biota replacement, and different sediment types often 
distribute different biota. For example, in soft bottom sediments, sea gills, starfish, sea 
urchins, and sea cucumbers are more common, while in hard rock bottom sediments, 
sponges, black corals, gorgonians, and sea anemones are dominant. The research on 
seamounts primarily focuses on the macrobenthos, whose individual is more than 2 
cm and can be identified through the seabed image. 

With their unique biological communities, rich biodiversity, and huge resource 
value, seamounts have become the focus of deep-sea biodiversity protection. At 
present, the protection of marine Biodiversity Beyond National Jurisdiction (BBNJ) 
has become an issue of global concern. Scientific understanding of the biological 
composition and distribution of seamounts is the key to the development, utilization, 
and protection of this fragile deep-sea ecosystem. It is the most concentrated among 
the seamounts globally and has the most significant number in the Western Pacific 
Ocean. The Western Pacific is the area with the most densely distributed seamounts 
and the most developed trench-arc-basin system globally. The cross-linking area of 
the Yapu Trench, Mariana Trench, and Caroline Ridge is the most representative. It 
is also one of the areas with minor research on seamounts in the world. 

The most considerable difficulty in studying deep-sea biodiversity lies in the acqui- 
sition of deep-sea specimens and data. Due to the complex topography of seamounts, 
biological sampling is more complicated than the general deep-sea. Among the more 
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than 30000 seamounts globally, only 1% of them have been carried out in biological 
sampling, and only about 50 seamounts have been sampled comprehensively. As a 
result, seamounts are still one of the “least known biological habitats” for humans. 
The research on the biodiversity of seamounts is limited to the macrobenthos, and 
most of the research only focuses on the species composition, while a few focus on 
community structure and distribution. Due to the limitation of sample acquisition 
and insufficient sampling, a considerable part of the classification and identifica- 
tion of seamount organisms is based on the analysis of video and image of benthic 
organisms. Many novel organisms cannot be identified due to the lack of samples. 

In recent years, our country has successfully carried out some seamount explo- 
rations represented by ‘Jiaolong’ HOV and ‘FaXian’ ROV and obtained many first- 
hand submarine image data in the South China Sea, Western and Central Pacific. It 
significantly improves the deep-sea detection level of our country and provides data 
support for automatic detection of benthos in seamounts. 


4.2 Seamount Macrobenthos Dataset 


Supported by the strategic leading science and technology project of the Chinese 
Academy of Sciences(A) “material and energy exchange and its impact on the tropical 
western Pacific Ocean system”, the Institute of Oceanology, Chinese Academy of 
Sciences has established a research system for the detection of marine biodiversity in 
seamounts through the construction of technical platform and team. A comprehensive 
survey of the deep-sea environment, biodiversity, and ecosystem structure of three 
seamounts in the cross-linking area of Yapu and Mariana Trench and Caroline Ridge 
in the Western Pacific Ocean was carried out (as shown in Fig. 8). More than 1000 
giant and large biological samples were collected through the sampling of seamount 
detection by using “FaXian” ROV, and more than 880 GB in situ imaging data of 
seabed organisms were obtained. In Fig. 8, the peak of Yapu seamount (Y3) is located 
in 8°51’/N, 137°47’E; the peak of the Mariana seamount (M2) is located in 11°19’N, 
139°20’ E; the peak of the Caroline seamount (M4) is located in 10°29’N, 140°8' E. 

Based on the in-situ image data of macrobenthos obtained from the above three 
seamounts’ surveys, the 63 in-situ image data were labeled as Paskal VOC format 
data by LabelImg, an image annotation tool. This images data include Pheronemoides 
fungosus Gong & Li, 2017 [10], Paragorgia rubra Li, Zhan & Xu, 2017 [20], Chryso- 
gorgia ramificans Xu et al., 2019 [44], Paraphelliactis tangi Li & Xu, 2016 [19], 
Poloipogon distortus Gong & Li, 2018 [11] and Chrysogorgia binata Xu et al., 2019 
[44]. These six species are newly discovered in recent years. Then we check all data 
manually to ensure that all image resolutions are 1920 x 1080. 

In computer vision, the typical annotation method is annotating the object on 
the image with a rectangular bounding box. In seamount data, the bounding box is 
labeled with (Xmin; Ymin, Xmax, Ymax), and (Xmin; Ymin) and (Xmax Ymax) are the two 
vertices on the diagonal of the rectangular label box. The whole macrobenthos data 
set of seamounts are collected, and each image corresponds to an XML annotation 
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Mariana 


éamount 


Fig. 8 Seamount data acquisition location and discovery ROV 


file. Finally, stratified random sampling is used to divide the labeled data into the 
training set and test set according to the ratio of 8:2. 


4.3. The SSD Framework for DMS 


The experimental framework of macrobenthos detection in seamounts with the SSD 
model is shown in Fig. 9. 


ee a ee a Detection results 
Backbone module: it is a widely used feature extraction framework. It is composed of convolution 
layers and pooling layers, such as VGG16 or ResNet. 


Feature extraction module 1: it is the feature extraction unit of backbone. It is used for 
[| information extraction. It is composed of convolution layers and pooling layer, determined by the 
backbone type. 


Feature extraction module 2: it is the feature extraction unit out of the backbone. It is used for 
high-level information extraction. It is composed of one or two stacking convolution layers. 


Detection module: it outputs the predicted confidence and the location of each prior box at each 
pixel. It is composed of convolution layer. 


[6] Concatenation: detections from different feature layers are concatenated. 


Non-Maximum Suppression: It output the maximal probabilities classifications, but suppress the 
close-by ones that are non-maxima. It makes sure that the algorithm detects each object only once. 


Fig. 9 The entire process of underwater species detection by SSD 
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SSD is based on VGG16 [38], which is pre-trained on the ILSVRC CLS-LOC 
dataset [36]. We convert FC6 (sixth fully connected layer) and FC7 to convolutional 
layers, subsample parameters from FC6 and FC7, after that remove all the dropout 
layers and the FC8 layer using SSD Weighted Loss Function [23]. We adjust the 
outcome model using SGD with initial learning rate 1074, 0.9 momentum, 0.0005 
weight decay, and batch size 32. The entire process can be seen in Fig. 9. 


4.4 Experimental Results and Discussions 


Part of the output of our SSD model is shown in Fig. 10. Among all the six different 
marine species, our SSD model achieved 98.04% mAP (mean Average Precision) 
and the average value of IOU (Intersection-over-Union) over 0.8 on the test dataset 
with 63 images. 

In our experiment, although we have verified that the implementation of SSD 
on our marine species data is feasible, SSD still often fails to detect small objects. 
Besides, our sample size and marine species categories are not enough. In the future, 
we will improve the SSD, and improve the ability to detect small objects in the camera 
further. Our ultimate vision is to build an AI system that can identify hundreds of 
thousands of marine species in real-time. 


Fig. 10 a Pheronemoides fungosus Gong & Li, 2017, b Paragorgia rubra Li, Zhan & Xu, 2017, ¢ 
Chrysogorgia ramificans Xu et al., 2019, d Paraphelliactis tangi Li & Xu, 2016 
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4.5 Summary 


This section first introduces the overall structure of the SSD model applied and 
describes the network structure in detail. Next, the structure of the convolution neural 
network used for feature extraction is introduced, and 63 images are trained by this 
network. The experimental results show that the SSD recognition effect is good and 
it can be applied to the detection macrobenthos in seamounts. 


5 Conclusions and Future Works 


Target detection is one of the three major tasks in the field of computer vision. 
Computer vision algorithms are gradually applied to underwater scenes with the 
continuous development of deep learning technology and its wide application on 
land. In this chapter, the application of deep learning algorithm in target detection 
is extended to detecting marine organism. The detection and counting of marine 
organisms are studied in detail. First of all, in marine static image biological detection 
and counting, we explore the network architecture based on Faster R-CNN with FPN. 
Then, we verify the feasibility of SSD in the detection of giant benthic organisms in 
seamounts. 

This chapter has completed the critical technologies of quantitative analysis sys- 
tem of artificial intelligence for marine benthos, realized the integrated development 
of marine big data, marine artificial intelligence and marine Internet of things, pro- 
moted the comprehensive development of marine artificial intelligence application, 
and filled the lack of artificial intelligence application in the deep-sea field partly. 

Based on the above research, our subsequent work includes the following aspects: 


1. Further expand the species richness. In addition, manual tagging is time-consuming 
and labor-consuming. The active learning method will be used for tagging in the 
subsequent expansion of the deep-sea biology training database. 

2. The following research will focus on the dynamic video object detection and 
counting algorithm based on the static image. 

3. Promote the AI algorithm model’s landing and start developing the deep-sea 
macro-organism recognition and quantitative analysis system. 
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